mbox series

[GIT,PULL] Please pull NFS client changes for Linux 5.18

Message ID 7b0576d8d49bbd358767620218c1966e8fe9a883.camel@hammerspace.com (mailing list archive)
State New, archived
Headers show
Series [GIT,PULL] Please pull NFS client changes for Linux 5.18 | expand

Pull-request

git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.18-1

Message

Trond Myklebust March 29, 2022, 7:36 p.m. UTC
Hi Linus,

The following changes since commit 53ab78cd6d5aba25575a7cfb95729336ba9497d8:

  Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux (2022-02-24 17:35:22 -0800)

are available in the Git repository at:

  git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.18-1

for you to fetch changes up to 7c9d845f0612e5bcd23456a2ec43be8ac43458f1:

  NFSv4/pNFS: Fix another issue with a list iterator pointing to the head (2022-03-28 08:36:34 -0400)

Please note that Stephen Rothwell has reported a conflict between
this branch and upstream:

On Tue, 15 Mar 2022 20:45:40 +1100 Stephen Rothwell <sfr@canb.auug.org.au> wrote:
>
> Today's linux-next merge of the folio tree got a conflict in:
>
>   fs/nfs/file.c
>
> between commit:
>
>   8786fde8421c ("Convert NFS from readpages to readahead")
>
> from the nfs tree and commit:
>
>   821405cf3ebb ("fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio")
>
> from the folio tree.

The fix should be trivial.

Thanks,
  Trond

----------------------------------------------------------------
NFS client updates for Linux 5.18

Highlights include:

Features:
- Switch NFS to use readahead instead of the obsolete readpages.
- Readdir fixes to improve cacheability of large directories when there
  are multiple readers and writers.
- Readdir performance improvements when doing a seekdir() immediately
  after opening the directory (common when re-exporting NFS).
- NFS swap improvements from Neil Brown.
- Loosen up memory allocation to permit direct reclaim and write back
  in cases where there is no danger of deadlocking the writeback code or
  NFS swap.
- Avoid sillyrename when the NFSv4 server claims to support the
  necessary features to recover the unlinked but open file after reboot.

Bugfixes:
- Patch from Olga to add a mount option to control NFSv4.1 session
  trunking discovery, and default it to being off.
- Fix a lockup in nfs_do_recoalesce().
- Two fixes for list iterator variables being used when pointing to the
  list head.
- Fix a kernel memory scribble when reading from a non-socket transport
  in /sys/kernel/sunrpc.
- Fix a race where reconnecting to a server could leave the TCP socket
  stuck forever in the connecting state.
- Patch from Neil to fix a shutdown race which can leave the SUNRPC
  transport timer primed after we free the struct xprt itself.
- Patch from Xin Xiong to fix reference count leaks in the NFSv4.2 copy
  offload.
- Sunrpc patch from Olga to avoid resending a task on an offlined
  transport.

Cleanups:
- Patches from Dave Wysochanski to clean up the fscache code

----------------------------------------------------------------
Alexey Khoroshilov (1):
      NFS: remove unneeded check in decode_devicenotify_args()

Benjamin Coddington (1):
      NFSv4: use unique client identifiers in network namespaces

Colin Ian King (1):
      SUNRPC: remove redundant pointer plainhdr

Dave Wysochanski (4):
      NFS: Cleanup usage of nfs_inode in fscache interface
      NFS: Rename fscache read and write pages functions
      NFS: Replace dfprintks with tracepoints in fscache read and write page functions
      NFS: Remove remaining dfprintks related to fscache and remove NFSDBG_FSCACHE

Jakob Koschel (1):
      NFS: replace usage of found with dedicated list iterator variable

Matthew Wilcox (Oracle) (1):
      Convert NFS from readpages to readahead

NeilBrown (12):
      NFS: remove IS_SWAPFILE hack
      SUNRPC/call_alloc: async tasks mustn't block waiting for memory
      SUNRPC/auth: async tasks mustn't block waiting for memory
      SUNRPC/xprt: async tasks mustn't block waiting for memory
      SUNRPC: remove scheduling boost for "SWAPPER" tasks.
      NFS: discard NFS_RPC_SWAPFLAGS and RPC_TASK_ROOTCREDS
      SUNRPC: improve 'swap' handling: scheduling and PF_MEMALLOC
      NFSv4: keep state manager thread active if swap is enabled
      NFS: swap IO handling is slightly different for O_DIRECT IO
      NFS: swap-out must always use STABLE writes.
      SUNRPC: change locking for xs_swap_enable/disable
      SUNRPC: avoid race between mod_timer() and del_timer_sync()

Olga Kornievskaia (5):
      NFSv4.1 support for NFS4_RESULT_PRESERVER_UNLINKED
      NFSv4.1 restrict GETATTR fs_location query to the main transport
      NFSv4.1 provide mount option to toggle trunking discovery
      SUNRPC don't resend a task on an offlined transport
      NFSv4.1: don't retry BIND_CONN_TO_SESSION on session error

Tom Rix (1):
      NFS: simplify check for freeing cn_resp

Trond Myklebust (63):
      NFSv4: Protect the state recovery thread against direct reclaim
      NFS: Charge open/lock file contexts to kmemcg
      NFSv4: Charge NFSv4 open state trackers to kmemcg
      NFSv4.2: Fix up an invalid combination of memory allocation flags
      NFS: Convert GFP_NOFS to GFP_KERNEL
      NFSv4/flexfiles: Convert GFP_NOFS to GFP_KERNEL
      NFSv4.2/copyoffload: Convert GFP_NOFS to GFP_KERNEL
      SUNRPC: Convert GFP_NOFS to GFP_KERNEL
      SUNRPC/auth_gss: Convert GFP_NOFS to GFP_KERNEL
      SUNRPC/xprtrdma: Convert GFP_NOFS to GFP_KERNEL
      NFS: Replace last uses of NFS_INO_REVAL_PAGECACHE
      NFS: Remove unused flag NFS_INO_REVAL_PAGECACHE
      NFS: NFSv2/v3 clients should never be setting NFS_CAP_XATTR
      NFS: Remove unnecessary XATTR cache invalidation in nfs_fhget()
      NFS: Clean up NFSv4.2 xattrs
      NFS: Use of mapping_set_error() results in spurious errors
      Revert "NFSv4: use unique client identifiers in network namespaces"
      NFS: Return valid errors from nfs2/3_decode_dirent()
      NFS: constify nfs_server_capable() and nfs_have_writebacks()
      NFS: Trace lookup revalidation failure
      NFS: Initialise the readdir verifier as best we can in nfs_opendir()
      NFS: Use kzalloc() to avoid initialising the nfs_open_dir_context
      NFS: Calculate page offsets algorithmically
      NFS: Store the change attribute in the directory page cache
      NFS: Don't re-read the entire page cache to find the next cookie
      NFS: Don't advance the page pointer unless the page is full
      NFS: Adjust the amount of readahead performed by NFS readdir
      NFS: If the cookie verifier changes, we must invalidate the page cache
      NFS: Simplify nfs_readdir_xdr_to_array()
      NFS: Reduce use of uncached readdir
      NFS: Improve heuristic for readdirplus
      NFS: Don't ask for readdirplus unless it can help nfs_getattr()
      NFSv4: Ask for a full XDR buffer of readdir goodness
      NFS: Readdirplus can't help lookup for case insensitive filesystems
      NFS: Don't request readdirplus when revalidation was forced
      NFS: Add basic readdir tracing
      NFS: Trace effects of readdirplus on the dcache
      NFS: Trace effects of the readdirplus heuristic
      NFS: Clean up page array initialisation/free
      NFS: Convert readdir page cache to use a cookie based index
      NFS: Fix up forced readdirplus
      NFS: Optimise away the previous cookie field
      NFS: Cache all entries in the readdirplus reply
      NFS: Don't deadlock when cookie hashes collide
      NFS: Fix revalidation of empty readdir pages
      SUNRPC: Don't call connect() more than once on a TCP socket
      SUNRPC: Only save the TCP source port after the connection is complete
      SUNRPC: Fix socket waits for write buffer space
      SUNRPC: Replace internal use of SOCKWQ_ASYNC_NOSPACE
      SUNRPC: Improve accuracy of socket ENOBUFS determination
      NFS: Fix memory allocation in rpc_malloc()
      NFS: Fix memory allocation in rpc_alloc_task()
      SUNRPC: Fix unx_lookup_cred() allocation
      SUNRPC: Make the rpciod and xprtiod slab allocation modes consistent
      NFS: nfsiod should not block forever in mempool_alloc()
      NFS: Avoid writeback threads getting stuck in mempool_alloc()
      NFSv4/pnfs: Ensure pNFS allocation modes are consistent with nfsiod
      pNFS/flexfiles: Ensure pNFS allocation modes are consistent with nfsiod
      pNFS/files: Ensure pNFS allocation modes are consistent with nfsiod
      SUNRPC: Do not dereference non-socket transports in sysfs
      SUNRPC: Don't return error values in sysfs read of closed files
      NFS: Don't loop forever in nfs_do_recoalesce()
      NFSv4/pNFS: Fix another issue with a list iterator pointing to the head

Xin Xiong (1):
      NFSv4.2: fix reference count leaks in _nfs42_proc_copy_notify()

 fs/nfs/Kconfig                          |   4 +
 fs/nfs/callback_proc.c                  |  29 +-
 fs/nfs/callback_xdr.c                   |   4 -
 fs/nfs/client.c                         |   3 +-
 fs/nfs/delegation.c                     |   2 +-
 fs/nfs/dir.c                            | 626 +++++++++++++++++++-------------
 fs/nfs/direct.c                         |  48 ++-
 fs/nfs/file.c                           |  26 +-
 fs/nfs/filelayout/filelayout.c          |   2 +-
 fs/nfs/flexfilelayout/flexfilelayout.c  |  53 ++-
 fs/nfs/fs_context.c                     |   8 +
 fs/nfs/fscache.c                        |  53 +--
 fs/nfs/fscache.h                        |  45 +--
 fs/nfs/inode.c                          |  86 ++---
 fs/nfs/internal.h                       |  25 +-
 fs/nfs/nfs2xdr.c                        |   3 +-
 fs/nfs/nfs3xdr.c                        |  30 +-
 fs/nfs/nfs42proc.c                      |  34 +-
 fs/nfs/nfs42xattr.c                     |   7 +-
 fs/nfs/nfs4_fs.h                        |   1 +
 fs/nfs/nfs4file.c                       |   8 +-
 fs/nfs/nfs4proc.c                       |  62 +++-
 fs/nfs/nfs4state.c                      |  59 ++-
 fs/nfs/nfs4xdr.c                        |   7 +-
 fs/nfs/nfstrace.h                       | 221 ++++++++++-
 fs/nfs/pagelist.c                       |  11 +-
 fs/nfs/pnfs.c                           |  50 +--
 fs/nfs/pnfs.h                           |   2 +
 fs/nfs/pnfs_nfs.c                       |   8 +-
 fs/nfs/proc.c                           |   1 +
 fs/nfs/read.c                           |  29 +-
 fs/nfs/write.c                          |  43 ++-
 include/linux/nfs_fs.h                  |  45 +--
 include/linux/nfs_fs_sb.h               |   1 +
 include/linux/nfs_xdr.h                 |   5 +-
 include/linux/sunrpc/auth.h             |   1 +
 include/linux/sunrpc/sched.h            |   2 +-
 include/linux/sunrpc/xprt.h             |   3 +
 include/linux/sunrpc/xprtsock.h         |   3 +-
 include/trace/events/sunrpc.h           |   1 -
 include/uapi/linux/nfs4.h               |   1 +
 include/uapi/linux/nfs_fs.h             |   2 +-
 net/sunrpc/auth.c                       |   8 +-
 net/sunrpc/auth_gss/auth_gss.c          |  26 +-
 net/sunrpc/auth_gss/auth_gss_internal.h |   2 +-
 net/sunrpc/auth_gss/gss_krb5_crypto.c   |  10 +-
 net/sunrpc/auth_gss/gss_krb5_seqnum.c   |   4 +-
 net/sunrpc/auth_gss/gss_krb5_wrap.c     |   4 +-
 net/sunrpc/auth_unix.c                  |  16 +-
 net/sunrpc/backchannel_rqst.c           |   8 +-
 net/sunrpc/clnt.c                       |  13 +-
 net/sunrpc/rpcb_clnt.c                  |   4 +-
 net/sunrpc/sched.c                      |  56 ++-
 net/sunrpc/socklib.c                    |   3 +-
 net/sunrpc/sysfs.c                      |  76 ++--
 net/sunrpc/xprt.c                       |  23 +-
 net/sunrpc/xprtrdma/frwr_ops.c          |   2 +-
 net/sunrpc/xprtrdma/transport.c         |  10 +-
 net/sunrpc/xprtrdma/verbs.c             |   4 +-
 net/sunrpc/xprtsock.c                   | 207 ++++++-----
 60 files changed, 1317 insertions(+), 813 deletions(-)

Comments

pr-tracker-bot@kernel.org March 30, 2022, 2 a.m. UTC | #1
The pull request you sent on Tue, 29 Mar 2022 19:36:00 +0000:

> git://git.linux-nfs.org/projects/trondmy/linux-nfs.git tags/nfs-for-5.18-1

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/965181d7ef7e1a863477536dc328c23a7ebc8a1d

Thank you!
Linus Torvalds March 30, 2022, 6:17 p.m. UTC | #2
On Tue, Mar 29, 2022 at 12:36 PM Trond Myklebust
<trondmy@hammerspace.com> wrote:
>
> - Readdir fixes to improve cacheability of large directories when there
>   are multiple readers and writers.

So I only took a look at this part now. I've obviously already pulled
it, but that use of 'xxhash()' just made me go "Whaaa?"

It's claimed that it's used because of its extreme performance, but
the performance numbers come from using it as a block hash.

That's not what nfs does.

The nfs code just does

    xxhash(&cookie, sizeof(cookie), 0) & NFS_READDIR_COOKIE_MASK;

where that "cookie" is just a 64-bit entity. And then it masks off
everything but 18 bits.

Is that *really* appropriate use of a new hash function?

Why is this not just doing

  #include <hash.h>

  hash_64(cookie, 18);

which is a lot more obvious than xxhash().

If there really are some serious problems with the perfectly standard
hash() functionality, I think you should document them.

Because just randomly picking xxhash() without explaining _why_ you
can't just use the same simple thing we use elsewhere is very odd.

Or rather, when the only documentation is "performance", then I think
the regular "hash_64()" is the obvious and trivial choice.

                  Linus
Trond Myklebust March 30, 2022, 8:01 p.m. UTC | #3
On Wed, 2022-03-30 at 11:17 -0700, Linus Torvalds wrote:
> On Tue, Mar 29, 2022 at 12:36 PM Trond Myklebust
> <trondmy@hammerspace.com> wrote:
> > 
> > - Readdir fixes to improve cacheability of large directories when
> > there
> >   are multiple readers and writers.
> 
> So I only took a look at this part now. I've obviously already pulled
> it, but that use of 'xxhash()' just made me go "Whaaa?"
> 
> It's claimed that it's used because of its extreme performance, but
> the performance numbers come from using it as a block hash.
> 
> That's not what nfs does.
> 
> The nfs code just does
> 
>     xxhash(&cookie, sizeof(cookie), 0) & NFS_READDIR_COOKIE_MASK;
> 
> where that "cookie" is just a 64-bit entity. And then it masks off
> everything but 18 bits.
> 
> Is that *really* appropriate use of a new hash function?
> 
> Why is this not just doing
> 
>   #include <hash.h>
> 
>   hash_64(cookie, 18);
> 
> which is a lot more obvious than xxhash().
> 
> If there really are some serious problems with the perfectly standard
> hash() functionality, I think you should document them.
> 
> Because just randomly picking xxhash() without explaining _why_ you
> can't just use the same simple thing we use elsewhere is very odd.
> 
> Or rather, when the only documentation is "performance", then I think
> the regular "hash_64()" is the obvious and trivial choice.
> 
>                   Linus

My main worry was that hash_64() would have too many collisions. Since
this is using cuckoo nesting, that would be a problem.

I did some quick studies after I got your email, and it seems as if my
concerns were unfounded. I've tested both a linear index and a sample
of ext4 getdents offsets.
While the sample of ext4 offsets did show a larger number of collisions
than a simple linear index, it wasn't too terrible (3 collisions in a
sample of 9000 entries).
The linear index showed no collisions up to about 100000 entries.

So, I'd be OK with changing to hash_64() and will queue up a bugfix for
it. I should have done this study earlier...
Linus Torvalds March 30, 2022, 8:11 p.m. UTC | #4
On Wed, Mar 30, 2022 at 1:02 PM Trond Myklebust <trondmy@hammerspace.com> wrote:
>
> My main worry was that hash_64() would have too many collisions.

So as you found out, hash_64() is actually a fairly good hash despite
being very simple.

In fact, with an input of just one word, it's generally hard to do
much better. I'm obviously not claiming it's a _cryptographic_ hash,
but compared to the usual "xor and shift a few times", it really
shouldn't be too bad.

And that's really what xxhash() does anyway.

I think the reason to use xxhash() would be if you have a noticeably
bigger input, or some truly special cases.

But for a single word, and then not even using very many of the
resulting bits, our regular family of 'hash()' routines really should
generally be perfectly fine. It should spread and mix the input bits
competently.

            Linus
Trond Myklebust March 30, 2022, 10:22 p.m. UTC | #5
On Wed, 2022-03-30 at 16:01 -0400, Trond Myklebust wrote:
> On Wed, 2022-03-30 at 11:17 -0700, Linus Torvalds wrote:
> > On Tue, Mar 29, 2022 at 12:36 PM Trond Myklebust
> > <trondmy@hammerspace.com> wrote:
> > > 
> > > - Readdir fixes to improve cacheability of large directories when
> > > there
> > >   are multiple readers and writers.
> > 
> > So I only took a look at this part now. I've obviously already
> > pulled
> > it, but that use of 'xxhash()' just made me go "Whaaa?"
> > 
> > It's claimed that it's used because of its extreme performance, but
> > the performance numbers come from using it as a block hash.
> > 
> > That's not what nfs does.
> > 
> > The nfs code just does
> > 
> >     xxhash(&cookie, sizeof(cookie), 0) & NFS_READDIR_COOKIE_MASK;
> > 
> > where that "cookie" is just a 64-bit entity. And then it masks off
> > everything but 18 bits.
> > 
> > Is that *really* appropriate use of a new hash function?
> > 
> > Why is this not just doing
> > 
> >   #include <hash.h>
> > 
> >   hash_64(cookie, 18);
> > 
> > which is a lot more obvious than xxhash().
> > 
> > If there really are some serious problems with the perfectly
> > standard
> > hash() functionality, I think you should document them.
> > 
> > Because just randomly picking xxhash() without explaining _why_ you
> > can't just use the same simple thing we use elsewhere is very odd.
> > 
> > Or rather, when the only documentation is "performance", then I
> > think
> > the regular "hash_64()" is the obvious and trivial choice.
> > 
> >                   Linus
> 
> My main worry was that hash_64() would have too many collisions.
> Since
> this is using cuckoo nesting, that would be a problem.
> 
> I did some quick studies after I got your email, and it seems as if
> my
> concerns were unfounded. I've tested both a linear index and a sample
> of ext4 getdents offsets.
> While the sample of ext4 offsets did show a larger number of
> collisions
> than a simple linear index, it wasn't too terrible (3 collisions in a
> sample of 9000 entries).

Actually, let me correct that.

With 9175 ext4 offsets, I see 157 collisions (== hash buckets with > 1
entry). So hash_64() does perform less well when you're hashing a value
that is already a hash.

> The linear index showed no collisions up to about 100000 entries.

This is unchanged, so with XFS and btrfs as the exported filesystems,
we should not have a problem.

> 
> So, I'd be OK with changing to hash_64() and will queue up a bugfix
> for
> it. I should have done this study earlier...
>
Linus Torvalds March 30, 2022, 10:45 p.m. UTC | #6
On Wed, Mar 30, 2022 at 3:22 PM Trond Myklebust <trondmy@hammerspace.com> wrote:
>
> With 9175 ext4 offsets, I see 157 collisions (== hash buckets with > 1
> entry). So hash_64() does perform less well when you're hashing a value
> that is already a hash.

No collisions with xxhash? Because xxhash() reality seems to do pretty
similar things in the end (multiply by a prime, shift bits down and
xor them).

In fact, the main difference seems to be that xxhash() will do a
"rotl()" by 27 before doing the prime multiplication, and then it will
finish the thing by a few more multiples mixed with shifting the high
bits down a few times.

Our regular fast hash doesn't do the "shift bits down", because it
relies on only using the upper bits anyway (and it is pretty heavily
geared towards "fast and good enough").

But if the *source* of the hash has a lot of low bits clear, I can
imagine that the "rotl" that xxhash does improves on the bit
distribution of the multiplication (which will only move bits
upwards).

And if it turns out our default hash has some bad cases, I'd prefer to
fix _that_ regardless..

             Linus
Trond Myklebust March 30, 2022, 11:18 p.m. UTC | #7
On Wed, 2022-03-30 at 15:45 -0700, Linus Torvalds wrote:
> On Wed, Mar 30, 2022 at 3:22 PM Trond Myklebust
> <trondmy@hammerspace.com> wrote:
> > 
> > With 9175 ext4 offsets, I see 157 collisions (== hash buckets with
> > > 1
> > entry). So hash_64() does perform less well when you're hashing a
> > value
> > that is already a hash.
> 
> No collisions with xxhash? Because xxhash() reality seems to do
> pretty
> similar things in the end (multiply by a prime, shift bits down and
> xor them).
> 
> In fact, the main difference seems to be that xxhash() will do a
> "rotl()" by 27 before doing the prime multiplication, and then it
> will
> finish the thing by a few more multiples mixed with shifting the high
> bits down a few times.
> 
> Our regular fast hash doesn't do the "shift bits down", because it
> relies on only using the upper bits anyway (and it is pretty heavily
> geared towards "fast and good enough").
> 
> But if the *source* of the hash has a lot of low bits clear, I can
> imagine that the "rotl" that xxhash does improves on the bit
> distribution of the multiplication (which will only move bits
> upwards).
> 
> And if it turns out our default hash has some bad cases, I'd prefer
> to
> fix _that_ regardless..
> 

Hmm... No there doesn't appear to be a huge difference between the two.
With both test programs running on the same data set of ext4 getdents
offsets, I see the following.

With xxhash64 reduced to 18 bits, I see:
read 57654 entries
min = 0, max = 5, collisions = 5501, avg = 1
read 98978 entries
min = 0, max = 6, collisions = 14730, avg = 1

..and with hash_64() reduced to 18 bits:
read 57654 entries
min = 0, max = 4, collisions = 5538, avg = 1
read 98978 entries
min = 0, max = 5, collisions = 14623, avg = 1

So they both appear to be seeing similar collision rates with the same
data sets
Linus Torvalds March 31, 2022, 12:52 a.m. UTC | #8
On Wed, Mar 30, 2022 at 4:18 PM Trond Myklebust <trondmy@hammerspace.com> wrote:
>
> Hmm... No there doesn't appear to be a huge difference between the two.

Ok, thanks for checking. It's basically what I would have expected,
both models really just depend on a reasonable mixing and shuffling of
bits in the word, and it really looks like they have very similar
collision behavior.

At some point even with equivalent functions you obviously end up with
just random differences that just depend on the input set and prue
luck.

But at least based on your numbers, it does look like they really are
equivalent, and hash_64() certainly doesn't look any worse.

All the extra work xxhash() does should matter mainly for

 (a) using more final bits of the hash (ir not reducing them in the end)

 (b) all the cases where you have much more input than one single word

Here, (b) obviously isn't an issue.

And that (a) is by design - those default kernel hash() functions are
literally designed for generating the index for hash tables, and so
expects that final reduction in size.

As a result, unlike xxhash(), it doesn't try to mix in bits in the low
bits, because it knows those will be shifted away in the use-case it's
designed for (the hash() reduction in bits always takes the high
bits).

But that use-case is really exactly what nfs is looking for too, which
is why I was expecting the regular hash_64() to JustWork(tm) for the
nfs case.

                 Linus