mbox series

[GIT,PULL] Ceph updates for 5.20-rc1

Message ID 20220811152446.281723-1-idryomov@gmail.com (mailing list archive)
State New, archived
Headers show
Series [GIT,PULL] Ceph updates for 5.20-rc1 | expand

Pull-request

https://github.com/ceph/ceph-client.git tags/ceph-for-5.20-rc1

Message

Ilya Dryomov Aug. 11, 2022, 3:24 p.m. UTC
Hi Linus,

The following changes since commit 3d7cb6b04c3f3115719235cc6866b10326de34cd:

  Linux 5.19 (2022-07-31 14:03:01 -0700)

are available in the Git repository at:

  https://github.com/ceph/ceph-client.git tags/ceph-for-5.20-rc1

for you to fetch changes up to a8af0d682ae0c9cf62dd0ad6afdb1480951d6a10:

  libceph: clean up ceph_osdc_start_request prototype (2022-08-03 14:05:39 +0200)

----------------------------------------------------------------
We have a good pile of various fixes and cleanups from Xiubo, Jeff,
Luis and others, almost exclusively in the filesystem.  Several patches
touch files outside of our normal purview to set the stage for bringing
in Jeff's long awaited ceph+fscrypt series in the near future.  All of
them have appropriate acks and sat in linux-next for a while.

----------------------------------------------------------------
Daichi Mukai (1):
      libceph: print fsid and epoch with osd id

Hu Weiwen (1):
      ceph: don't truncate file in atomic_open

Jason Wang (1):
      libceph: fix ceph_pagelist_reserve() comment typo

Jeff Layton (8):
      fs: change test in inode_insert5 for adding to the sb list
      fscrypt: export fscrypt_fname_encrypt and fscrypt_fname_encrypted_size
      fscrypt: add fscrypt_context_for_new_inode
      ceph: don't leak snap_rwsem in handle_cap_grant
      ceph: convert to generic_file_llseek
      ceph: call netfs_subreq_terminated with was_async == false
      ceph: switch back to testing for NULL folio->private in ceph_dirty_folio
      libceph: clean up ceph_osdc_start_request prototype

Li Qiong (1):
      libceph: check pointer before assigned to "c->rules[]"

Luis Henriques (2):
      ceph: use correct index when encoding client supported features
      ceph: prevent a client from exceeding the MDS maximum xattr size

Xiubo Li (13):
      ceph: remove useless CEPHFS_FEATURES_CLIENT_REQUIRED
      fs/dcache: export d_same_name() helper
      ceph: wait for the first reply of inflight async unlink
      ceph: add session already open notify support
      ceph: choose auth MDS for getxattr with the Xs caps
      ceph: fix the incorrect comment for the ceph_mds_caps struct
      ceph: fix incorrect old_size length in ceph_mds_request_args
      ceph: make change_auth_cap_ses a global symbol
      ceph: update the auth cap when the async create req is forwarded
      ceph: don't get the inline data for new creating files
      ceph: flush the dirty caps immediatelly when quota is approaching
      ceph: make f_bsize always equal to f_frsize
      ceph: remove useless check for the folio

 drivers/block/rbd.c             |   6 +-
 fs/ceph/addr.c                  |  59 ++++++--------
 fs/ceph/caps.c                  |  38 ++++-----
 fs/ceph/dir.c                   |  79 ++++++++++++++++---
 fs/ceph/file.c                  | 123 ++++++++++++------------------
 fs/ceph/inode.c                 |  13 +++-
 fs/ceph/mds_client.c            | 165 ++++++++++++++++++++++++++++++++++++++--
 fs/ceph/mds_client.h            |  13 ++--
 fs/ceph/mdsmap.c                |  22 +++++-
 fs/ceph/super.c                 |  19 +++--
 fs/ceph/super.h                 |  31 ++++++--
 fs/ceph/xattr.c                 |  12 ++-
 fs/crypto/fname.c               |  36 +++++++--
 fs/crypto/fscrypt_private.h     |   9 +--
 fs/crypto/hooks.c               |   6 +-
 fs/crypto/policy.c              |  35 +++++++--
 fs/dcache.c                     |  15 +++-
 fs/inode.c                      |  10 ++-
 include/linux/ceph/ceph_fs.h    |   8 +-
 include/linux/ceph/mdsmap.h     |   1 +
 include/linux/ceph/osd_client.h |   5 +-
 include/linux/dcache.h          |   2 +
 include/linux/fscrypt.h         |   5 ++
 include/linux/mmdebug.h         |  10 +++
 net/ceph/osd_client.c           |  15 ++--
 net/ceph/osdmap.c               |  32 ++++++--
 net/ceph/pagelist.c             |   2 +-
 27 files changed, 538 insertions(+), 233 deletions(-)

Comments

Linus Torvalds Aug. 11, 2022, 8:03 p.m. UTC | #1
On Thu, Aug 11, 2022 at 8:25 AM Ilya Dryomov <idryomov@gmail.com> wrote:
>
>    [..] Several patches
> touch files outside of our normal purview to set the stage for bringing
> in Jeff's long awaited ceph+fscrypt series in the near future.  All of
> them have appropriate acks and sat in linux-next for a while.

What? No.

I'm looking at the fs/dcache.c change, for example, and don't see the
relevant maintainers having acked it at all.

The mmdebug.h file similarly seems to not have the actual maintainer
acks, and seems just plain stupid (why does it add that new folio
warning macro, when the existing folio warning macro
VM_WARN_ON_ONCE_FOLIO() is *better*?

Those are some very core files, and while the changes seem harmless,
they sure don't seem obviously ok.

What's the point of warning about bogus folios more than once? That's
a debug warning - if it hits even once, that's already "uhhuh,
something is bad". Showing the warning more than once is likely just
going to cause more problems, not give you more information.

And did somebody verify that d_same_name() is still inlined in the
place that truly *matters*?  Because from my quick test, that patch
broke it. Now __d_lookup() does a function call.

And I _suspect_ it's all ok, because it turns out that
__d_lookup_rcu() is the *really* hot case, and that one has inlined it
all manually.

But this kind of "we touch some *truly* core functionality, without
the acks from the maintainers, and then we *claim* to have relevant
acks" is really not even remotely ok.

I've pulled this because I suspect that d_same_name() thing is fine,
and I think the VM_WARN_ON_FOLIO() addition is completely wrong but
not horrendous.

But you're on my tentative shit-list just for having claimed to have
appropriate acks and having been found wanting.

Just for your information: fs/dcache.c is some of the most optimized
code in the kernel, and some of the subtlest. That RCU pathname lookup
is serious business. You don't make changes to pathname lookup just
willy nilly. There's a reason I start looking at individual patches
when I see it in the diffstat.

And please just get rid of VM_WARN_ON_FOLIO() again, and use the
VM_WARN_ON_ONCE_FOLIO() version. Because if you expect to have
multiple, then you probably shouldn't have that warning at ALL.

                 Linus
Ilya Dryomov Aug. 11, 2022, 8:55 p.m. UTC | #2
On Thu, Aug 11, 2022 at 10:04 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Aug 11, 2022 at 8:25 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> >
> >    [..] Several patches
> > touch files outside of our normal purview to set the stage for bringing
> > in Jeff's long awaited ceph+fscrypt series in the near future.  All of
> > them have appropriate acks and sat in linux-next for a while.
>
> What? No.
>
> I'm looking at the fs/dcache.c change, for example, and don't see the
> relevant maintainers having acked it at all.
>
> The mmdebug.h file similarly seems to not have the actual maintainer
> acks, and seems just plain stupid (why does it add that new folio
> warning macro, when the existing folio warning macro
> VM_WARN_ON_ONCE_FOLIO() is *better*?
>
> Those are some very core files, and while the changes seem harmless,
> they sure don't seem obviously ok.
>
> What's the point of warning about bogus folios more than once? That's
> a debug warning - if it hits even once, that's already "uhhuh,
> something is bad". Showing the warning more than once is likely just
> going to cause more problems, not give you more information.

Xiubo and Jeff used it to track down some issues between netfs library
and folio code that have been randomly plaguing our automated tests for
a couple of releases.  We already knew that there were issues in that
area and the actual occurrences mattered.  This was done in cooperation
with Willy and, since he was involved and this is a no-impact change,
I didn't think twice.

>
> And did somebody verify that d_same_name() is still inlined in the
> place that truly *matters*?  Because from my quick test, that patch
> broke it. Now __d_lookup() does a function call.
>
> And I _suspect_ it's all ok, because it turns out that
> __d_lookup_rcu() is the *really* hot case, and that one has inlined it
> all manually.
>
> But this kind of "we touch some *truly* core functionality, without
> the acks from the maintainers, and then we *claim* to have relevant
> acks" is really not even remotely ok.

I raised the lack of a formal Acked-by from Al on the dcache change
with Jeff a while ago and my understanding was that he reached out to
Al and got the ack (after some ghosting on Al's behalf).  I apologize
if I got that wrong -- all this happened in the middle of Jeff
transitioning his maintainership duties.

>
> I've pulled this because I suspect that d_same_name() thing is fine,
> and I think the VM_WARN_ON_FOLIO() addition is completely wrong but
> not horrendous.
>
> But you're on my tentative shit-list just for having claimed to have
> appropriate acks and having been found wanting.
>
> Just for your information: fs/dcache.c is some of the most optimized
> code in the kernel, and some of the subtlest. That RCU pathname lookup
> is serious business. You don't make changes to pathname lookup just
> willy nilly. There's a reason I start looking at individual patches
> when I see it in the diffstat.

Understood.

Thanks,

                Ilya
Jeff Layton Aug. 11, 2022, 9:08 p.m. UTC | #3
On Thu, 2022-08-11 at 22:55 +0200, Ilya Dryomov wrote:
> On Thu, Aug 11, 2022 at 10:04 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > 
> > On Thu, Aug 11, 2022 at 8:25 AM Ilya Dryomov <idryomov@gmail.com> wrote:
> > > 
> > >    [..] Several patches
> > > touch files outside of our normal purview to set the stage for bringing
> > > in Jeff's long awaited ceph+fscrypt series in the near future.  All of
> > > them have appropriate acks and sat in linux-next for a while.
> > 
> > What? No.
> > 
> > I'm looking at the fs/dcache.c change, for example, and don't see the
> > relevant maintainers having acked it at all.
> > 
> > The mmdebug.h file similarly seems to not have the actual maintainer
> > acks, and seems just plain stupid (why does it add that new folio
> > warning macro, when the existing folio warning macro
> > VM_WARN_ON_ONCE_FOLIO() is *better*?
> > 
> > Those are some very core files, and while the changes seem harmless,
> > they sure don't seem obviously ok.
> > 
> > What's the point of warning about bogus folios more than once? That's
> > a debug warning - if it hits even once, that's already "uhhuh,
> > something is bad". Showing the warning more than once is likely just
> > going to cause more problems, not give you more information.
> 
> Xiubo and Jeff used it to track down some issues between netfs library
> and folio code that have been randomly plaguing our automated tests for
> a couple of releases.  We already knew that there were issues in that
> area and the actual occurrences mattered.  This was done in cooperation
> with Willy and, since he was involved and this is a no-impact change,
> I didn't think twice.
> 
> > 
> > And did somebody verify that d_same_name() is still inlined in the
> > place that truly *matters*?  Because from my quick test, that patch
> > broke it. Now __d_lookup() does a function call.
> > 
> > And I _suspect_ it's all ok, because it turns out that
> > __d_lookup_rcu() is the *really* hot case, and that one has inlined it
> > all manually.
> > 
> > But this kind of "we touch some *truly* core functionality, without
> > the acks from the maintainers, and then we *claim* to have relevant
> > acks" is really not even remotely ok.
> 
> I raised the lack of a formal Acked-by from Al on the dcache change
> with Jeff a while ago and my understanding was that he reached out to
> Al and got the ack (after some ghosting on Al's behalf).  I apologize
> if I got that wrong -- all this happened in the middle of Jeff
> transitioning his maintainership duties.
> 

Actually, I never got a formal ack from Al. I did send it repeatedly,
but I assume he has been too busy to respond. We've had it sitting in
linux-next for a couple of months, and he did suggest that approach in
the first place, but I too would also prefer to see his official ack on
it.


> > 
> > I've pulled this because I suspect that d_same_name() thing is fine,
> > and I think the VM_WARN_ON_FOLIO() addition is completely wrong but
> > not horrendous.
> > 
> > But you're on my tentative shit-list just for having claimed to have
> > appropriate acks and having been found wanting.
> > 
> > Just for your information: fs/dcache.c is some of the most optimized
> > code in the kernel, and some of the subtlest. That RCU pathname lookup
> > is serious business. You don't make changes to pathname lookup just
> > willy nilly. There's a reason I start looking at individual patches
> > when I see it in the diffstat.
> 
> Understood.
> 
> Thanks,
> 
>                 Ilya
Al Viro Aug. 11, 2022, 9:22 p.m. UTC | #4
On Thu, Aug 11, 2022 at 05:08:11PM -0400, Jeff Layton wrote:

> Actually, I never got a formal ack from Al. I did send it repeatedly,
> but I assume he has been too busy to respond. We've had it sitting in
> linux-next for a couple of months, and he did suggest that approach in
> the first place, but I too would also prefer to see his official ack on
> it.

"Suggested approach" had been about inode_insert5() changes, right?
But that's fs/inode.c side of things...  I have to admit that I'd missed
the unlining d_same_name() - exporting the sucker per se didn't look
insane and I hadn't looked at that in details ;-/

Looking at it now...  might be worth renaming it into __d_same_name(),
leaving it inlined and exporting a wrapper; not sure if the impact on
d_lookup()/__d_lookup()/d_alloc_parallel() is worth worrying about it,
though.

Profiling a case when we have a plenty of files in the same directory
on tmpfs, with something earlier in the pathname to kick out of RCU
mode (e.g. going through /proc/self/cwd) might be interesting...
Al Viro Aug. 11, 2022, 9:23 p.m. UTC | #5
On Thu, Aug 11, 2022 at 10:22:37PM +0100, Al Viro wrote:
> On Thu, Aug 11, 2022 at 05:08:11PM -0400, Jeff Layton wrote:
> 
> > Actually, I never got a formal ack from Al. I did send it repeatedly,
> > but I assume he has been too busy to respond. We've had it sitting in
> > linux-next for a couple of months, and he did suggest that approach in
> > the first place, but I too would also prefer to see his official ack on
> > it.
> 
> "Suggested approach" had been about inode_insert5() changes, right?
> But that's fs/inode.c side of things...  I have to admit that I'd missed
> the unlining d_same_name() - exporting the sucker per se didn't look

s/un/&in/ and I really need to grab some coffee...
Linus Torvalds Aug. 11, 2022, 9:29 p.m. UTC | #6
On Thu, Aug 11, 2022 at 1:56 PM Ilya Dryomov <idryomov@gmail.com> wrote:
>
> On Thu, Aug 11, 2022 at 10:04 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > What's the point of warning about bogus folios more than once? That's
> > a debug warning - if it hits even once, that's already "uhhuh,
> > something is bad". Showing the warning more than once is likely just
> > going to cause more problems, not give you more information.
>
> Xiubo and Jeff used it to track down some issues between netfs library
> and folio code that have been randomly plaguing our automated tests for
> a couple of releases.  We already knew that there were issues in that
> area and the actual occurrences mattered.  This was done in cooperation
> with Willy and, since he was involved and this is a no-impact change,
> I didn't think twice.

I don't mind the warning.

I mind the "more than ONCE" part.

If it's a "this shouldn't happen, but if it ever happens I want to
know about it" situation, then the ONCE variant should be what you
want.

And that variant already existed, and adding a new and inferior macro
seems to just have been pointless.

As to dcache issues - if you really don't get an ack from Al, you can
at least make me aware of it before-hand. That's one of the files that
I at least personally care about, and while I would much prefer an ack
from Al for anything that touches it, at least I'll likely be less
unhappy about changes if I was made aware of them ahead of time,
instead of seeing a pull request that suddenly changes that file.

               Linus
Jeff Layton Aug. 11, 2022, 9:30 p.m. UTC | #7
On Thu, 2022-08-11 at 22:22 +0100, Al Viro wrote:
> On Thu, Aug 11, 2022 at 05:08:11PM -0400, Jeff Layton wrote:
> 
> > Actually, I never got a formal ack from Al. I did send it repeatedly,
> > but I assume he has been too busy to respond. We've had it sitting in
> > linux-next for a couple of months, and he did suggest that approach in
> > the first place, but I too would also prefer to see his official ack on
> > it.
> 
> "Suggested approach" had been about inode_insert5() changes, right?

Right. I was talking about this patch (which I think is sane):

    fs: change test in inode_insert5 for adding to the sb list

> But that's fs/inode.c side of things...  I have to admit that I'd missed
> the unlining d_same_name() - exporting the sucker per se didn't look
> insane and I hadn't looked at that in details ;-/
> 
> Looking at it now...  might be worth renaming it into __d_same_name(),
> leaving it inlined and exporting a wrapper; not sure if the impact on
> d_lookup()/__d_lookup()/d_alloc_parallel() is worth worrying about it,
> though.
> 
> Profiling a case when we have a plenty of files in the same directory
> on tmpfs, with something earlier in the pathname to kick out of RCU
> mode (e.g. going through /proc/self/cwd) might be interesting...

The d_name_name changes seemed ok to me, but it would be good to have
your ack (or qualified NAK) if possible.
pr-tracker-bot@kernel.org Aug. 11, 2022, 9:32 p.m. UTC | #8
The pull request you sent on Thu, 11 Aug 2022 17:24:46 +0200:

> https://github.com/ceph/ceph-client.git tags/ceph-for-5.20-rc1

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/786da5da5671c2d4cf812fe1ccc980bdde30c69e

Thank you!
Al Viro Aug. 11, 2022, 9:38 p.m. UTC | #9
On Thu, Aug 11, 2022 at 05:30:03PM -0400, Jeff Layton wrote:
> On Thu, 2022-08-11 at 22:22 +0100, Al Viro wrote:
> > On Thu, Aug 11, 2022 at 05:08:11PM -0400, Jeff Layton wrote:
> > 
> > > Actually, I never got a formal ack from Al. I did send it repeatedly,
> > > but I assume he has been too busy to respond. We've had it sitting in
> > > linux-next for a couple of months, and he did suggest that approach in
> > > the first place, but I too would also prefer to see his official ack on
> > > it.
> > 
> > "Suggested approach" had been about inode_insert5() changes, right?
> 
> Right. I was talking about this patch (which I think is sane):
> 
>     fs: change test in inode_insert5 for adding to the sb list

It is, AFAICS.

> > But that's fs/inode.c side of things...  I have to admit that I'd missed
> > the unlining d_same_name() - exporting the sucker per se didn't look
> > insane and I hadn't looked at that in details ;-/
> > 
> > Looking at it now...  might be worth renaming it into __d_same_name(),
> > leaving it inlined and exporting a wrapper; not sure if the impact on
> > d_lookup()/__d_lookup()/d_alloc_parallel() is worth worrying about it,
> > though.
> > 
> > Profiling a case when we have a plenty of files in the same directory
> > on tmpfs, with something earlier in the pathname to kick out of RCU
> > mode (e.g. going through /proc/self/cwd) might be interesting...
> 
> The d_name_name changes seemed ok to me, but it would be good to have
> your ack (or qualified NAK) if possible.

Exporting the functionality?  Sure, no problem.  Uninlining that one...
I suspect that it's OK, but I'd like to see profiling data; it's not
as if it would be hard to return to having it inlined, obviously.

Again, my apologies for not spotting that one.
Linus Torvalds Aug. 11, 2022, 9:52 p.m. UTC | #10
On Thu, Aug 11, 2022 at 2:38 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Exporting the functionality?  Sure, no problem.  Uninlining that one...
> I suspect that it's OK, but I'd like to see profiling data; it's not
> as if it would be hard to return to having it inlined, obviously.

The only case where I think it might matter is in __d_lookup(), and
it's probably not measurable.

Yes, __d_lookup() does matter, but it only matters once you've fallen
out of RCU mode, and at that point the cost of the function call is
likely in the noise.

I don't particularly like how it's inside that dentry hash chain loop,
but realistically by then we've already done a function call for the
dentry lock spinlock, so that loop already has to deal with it.

Again, __d_lookup_rcu() is the place where adding a function call
would matter more, because that one really does show up on profiles
regularly.

And it so carefully tries to avoid function calls (but the
DCACHE_OP_COMPARE case causes problems: at one time a few years ago I
actually wanted to move the DCACHE_OP_COMPARE *out* of the loop
entirely, because it's loop invariant and having that unlikely cause
inside the loop still causes bad things for register allocation).

So I think the uninlining is fine. If I had been really unhappy about
it I would have undone the pull.

It was more the "I was told there would be cake, but there was no
cake" that annoyed me.

            Linus
Al Viro Aug. 11, 2022, 10:04 p.m. UTC | #11
On Thu, Aug 11, 2022 at 02:52:26PM -0700, Linus Torvalds wrote:
> On Thu, Aug 11, 2022 at 2:38 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > Exporting the functionality?  Sure, no problem.  Uninlining that one...
> > I suspect that it's OK, but I'd like to see profiling data; it's not
> > as if it would be hard to return to having it inlined, obviously.
> 
> The only case where I think it might matter is in __d_lookup(), and
> it's probably not measurable.
> 
> Yes, __d_lookup() does matter, but it only matters once you've fallen
> out of RCU mode, and at that point the cost of the function call is
> likely in the noise.
> 
> I don't particularly like how it's inside that dentry hash chain loop,
> but realistically by then we've already done a function call for the
> dentry lock spinlock, so that loop already has to deal with it.

FWIW, I wonder if we should do
	if (READ_ONCE(dentry->d_parent) != parent)
		continue;
before grabbing ->d_lock (and repeat the check after grabbing it,
of course).  It's OK from correctness POV - we are OK with false
negatives from __d_lookup() if concurrent rename happens.  And
it just might be a sufficiently large performance win...
Linus Torvalds Aug. 11, 2022, 10:22 p.m. UTC | #12
On Thu, Aug 11, 2022 at 3:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> FWIW, I wonder if we should do
>         if (READ_ONCE(dentry->d_parent) != parent)
>                 continue;
> before grabbing ->d_lock (and repeat the check after grabbing it,

It kind of makes sense. We already do that d_name.hash check outside
of the lock, so we already have that "we might race with a rename"
situation.

That said, I do think __d_lookup_rcu() is the more important of the two.

Here's a recreation of that patch I mentioned where the OP_COMPARE is
moved out of the loop. Just for fun, look at how much better the code
generation is for the common case when you don't have the call messing
up the clobbered registers etc.

Entirely untested, and I might have messed something up, but I suspect
this is a much bigger deal than whether d_same_name() is inlined or
not in the non-RCU path.

              Linus
Linus Torvalds Aug. 11, 2022, 10:43 p.m. UTC | #13
On Thu, Aug 11, 2022 at 3:22 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Here's a recreation of that patch I mentioned where the OP_COMPARE is
> moved out of the loop. Just for fun, look at how much better the code
> generation is for the common case when you don't have the call messing
> up the clobbered registers etc.

Oh, sadly, clang does much worse here.

Gcc ends up being able to not have a stack frame at all for
__d_lookup_rcu() once that DCACHE_OP_COMPARE case has been moved out.
The gcc code really looks very nice.

Clang, not so much, and it still has spills and reloads.

The loop still ends up better with clang (since that test is no longer
in the loop), but the code generated doesn't go from "ugly to really
nice", it just goes from "ugly to still somewhat ugly".

                  Linus
Al Viro Aug. 11, 2022, 10:44 p.m. UTC | #14
On Thu, Aug 11, 2022 at 03:22:38PM -0700, Linus Torvalds wrote:
> On Thu, Aug 11, 2022 at 3:04 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > FWIW, I wonder if we should do
> >         if (READ_ONCE(dentry->d_parent) != parent)
> >                 continue;
> > before grabbing ->d_lock (and repeat the check after grabbing it,
> 
> It kind of makes sense. We already do that d_name.hash check outside
> of the lock, so we already have that "we might race with a rename"
> situation.
> 
> That said, I do think __d_lookup_rcu() is the more important of the two.
> 
> Here's a recreation of that patch I mentioned where the OP_COMPARE is
> moved out of the loop. Just for fun, look at how much better the code
> generation is for the common case when you don't have the call messing
> up the clobbered registers etc.
> 
> Entirely untested, and I might have messed something up, but I suspect
> this is a much bigger deal than whether d_same_name() is inlined or
> not in the non-RCU path.

Looks sane at the first pass, but right now I'm really half-asleep - 5 hours
of sleep tonight, and about the same yesterday ;-/

I'll try to grab at least an hour or two and reread it once I'm more or less
awake...
Linus Torvalds Aug. 12, 2022, 3:58 a.m. UTC | #15
On Thu, Aug 11, 2022 at 3:43 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Oh, sadly, clang does much worse here.
>
> Gcc ends up being able to not have a stack frame at all for
> __d_lookup_rcu() once that DCACHE_OP_COMPARE case has been moved out.
> The gcc code really looks very nice.
>
> Clang, not so much, and it still has spills and reloads.

I ended up looking at the clang code generation more than I probably
should have, because I found it so odd.

Our code is literally written to not need that many values, and it
should be easy to keep everything in registers.

It turns out that clang is trying much too hard to be clever in
dentry_string_cmp(). The code is literally written so that we keep the
count of remaining characters in 'tcount', and then at the end we can
generate a 'mask' from that to ignore the parts of the pathname that
are beyond the size.

So it creates the mask by just doing

        mask = bytemask_from_count(tcount);

and the bytemask_from_count() is a very straightforward "take the byte
mask, multiply by 8 to get the bitmask, and then you can use that to
shift bits around and you get the byte mask:

   #define bytemask_from_count(cnt)       (~(~0ul << (cnt)*8))

But clang seems to decide that this "multiply by 8" all is a very
costly operation, and seems to have figured out that since we always
do everyting one word at a time, so 'tcount' is always updated in
sizeof(long), and that means that at the end 'tcount' is guaranteed to
be the original 'tcount & 7' (on a 64-bit machine).

End result: the above mask can be pre-computed before entering the loop at all.

Which is absolutely true, but adds to register pressure, and means
that you pre-compute that mask whether you actually need to or not.

But hey, even that is fine, because you can actually move the mask
outside *both* of the loops (both the hash chain *and* the inner
"check each pathname" loop, because the initial value of 'tcount' is
very much a fixed value per function call.

The computation is pretty cheap, the bigger expense is that now you
have one more thing you need to keep track of.

But on its own, that would probably still be ok, because hey, we have
extra registers. But it does make the liveness of that extra value be
quite large.

But the extra registers are then used by the fact that we have that

                b = load_unaligned_zeropad(ct);

which *ALSO* has that incredibly expensive "multiply by 8" operation,
except now it's a different value that needs to be multiplied by 8,
namely the offset within an aligned long-word, which we use to fix up
any unaligned faults that take exceptions.

And since clang seems to - once again - think that multiplying by 8 is
INCREDIBLY EXPENSIVE, what it does is say "hey, I see all the places
where that base pointer is updated, and I can have my own internal
variable that tracks that value, except I do "update-by-8" there.

So now clang has created _another_ special internal variable that it
updates inside the inner loop, which tracks the bit offset of the
*aligned* part of the 'ct' variable, so that in the *extremely*
unlikely (read: never actually happens) situation where that
load_unaligned_zeropad takes an exception, it has the shift value
pre-computed.

And now clang runs out of registers, and honestly, it took me a
*loong* time to understand what the generated code was even trying to
do, because this was all so incredibly pointless and the code looked
so very very odd.

Anyway, it's amusing mostly because both of those "optimizations" that
clang did actually required some very clever compiler work.

It's just that they were both doing extra work in order to avoid an
operation that was _less_ work in the first place.

I suspect that there's some core clang optimization that goes
"addition is much cheaper than multiplication, and if we have an index
variable that is multiplied by a value, we can keep a shadow variable
of that index that is 'pre-multiplied', and thus avoid the
multiplication entirely".

It's absolute genius.

It just happens to generate really quite bad code, because multiplying
by 8 outside the loop is _so_ much cheaper than what clang ends up
generating.

On the whole, I think that any value that needs one or two ALU
operations to be calculated should *not* be seen as some prime example
of something that needs to be pre-computed in order to move it outside
the loop. The extra register pressure will make it a loss.

But admittedly we also have no way to tell clang that that exception
basically never happens, so maybe it thinks it is really important.

I'm adding clang people to the list, in case anybody wants to look at
it. The patch that started my path down this insanity is at

   https://lore.kernel.org/all/CAHk-=wjCa=Xf=pA2Z844WnwEeYgy9OPoB2kWphvg7PVn3ohScw@mail.gmail.com/

in case somebody gets an itch to look at a case where clang generates
some very odd code.

That said, this code has been streamlined so much that even clang
can't *really* make the end result slow. Just not as good as what gcc
does, because clang tries a bit too hard and bites off a bit more than
it can chew.

               Linus
Al Viro Aug. 14, 2022, 7:08 p.m. UTC | #16
On Thu, Aug 11, 2022 at 08:58:54PM -0700, Linus Torvalds wrote:
> On Thu, Aug 11, 2022 at 3:43 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Oh, sadly, clang does much worse here.
> >
> > Gcc ends up being able to not have a stack frame at all for
> > __d_lookup_rcu() once that DCACHE_OP_COMPARE case has been moved out.
> > The gcc code really looks very nice.
> >
> > Clang, not so much, and it still has spills and reloads.
> 
> I ended up looking at the clang code generation more than I probably
> should have, because I found it so odd.
> 
> Our code is literally written to not need that many values, and it
> should be easy to keep everything in registers.
> 
> It turns out that clang is trying much too hard to be clever in
> dentry_string_cmp(). The code is literally written so that we keep the
> count of remaining characters in 'tcount', and then at the end we can
> generate a 'mask' from that to ignore the parts of the pathname that
> are beyond the size.

[snip]

There's a cheap way to reduce the register pressure:
                seq = raw_seqcount_begin(&dentry->d_seq);
		if (dentry->d_parent != parent)
			continue;
		if (d_unhashed(dentry))
			continue;
		if (dentry->d_name.hash_len != hashlen)
			continue;
		if (dentry_cmp(dentry, str, hashlen_len(hashlen)) != 0)
			continue;  
		*seqp = seq;
could move the last store to before dentry_cmp().  Sure, we might get
some extra stores out of that.  Into a hot cacheline, and if we really
hit many of those extra stores, we already have a problem - a lot of
collisions both in ->d_parent and ->d_name.hash_len.  If that happens,
the cost of those extra stores is going to be trivial noise.

From correctness POV that should be fine; callers of __d_lookup_rcu()
getting NULL either entirely ignore *seqp (d_alloc_parallel()) or
proceed to wipe it out (lookup_fast(), by calling try_to_unlazy()).

Comments?
Linus Torvalds Aug. 14, 2022, 8:03 p.m. UTC | #17
On Sun, Aug 14, 2022 at 12:08 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>
> There's a cheap way to reduce the register pressure:
>                 seq = raw_seqcount_begin(&dentry->d_seq);
>                 if (dentry->d_parent != parent)
>                         continue;
>                 if (d_unhashed(dentry))
>                         continue;
>                 if (dentry->d_name.hash_len != hashlen)
>                         continue;
>                 if (dentry_cmp(dentry, str, hashlen_len(hashlen)) != 0)
>                         continue;
>                 *seqp = seq;
> could move the last store to before dentry_cmp().

I actually tried that, it doesn't really end up helping.

Gcc does well regardless, and clang ends up really wanting to move so
much out of the dentry_cmp() loop that it runs out of registers and
always ends up doing a couple of spills.

I think it reduced the spills by one, but not enough to generate the
nice non-frame code that gcc does.

It's a bit ugly, but the code probably performs quite well - the
spills aren't in the inner loop.

             Linus
Linus Torvalds Aug. 14, 2022, 8:29 p.m. UTC | #18
On Sun, Aug 14, 2022 at 1:03 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Gcc does well regardless, and clang ends up really wanting to move so
> much out of the dentry_cmp() loop that it runs out of registers and
> always ends up doing a couple of spills.
>
> I think it reduced the spills by one, but not enough to generate the
> nice non-frame code that gcc does.

Note that that code was basically written to make gcc happy, so the
fact that clang then does worse is not hugely surprising.

I dug into it some more, and it is really "load_unaligned_zeropad()"
that makes clang really uncomfortable.

The problem ends up being that clang sees that it's inside that inner
loop, and tries very hard to optimize the shift-and-mask that happens
if the exception happens.

The fact that the exception *never* happens unless DEBUG_PAGEALLOC is
enabled - and very very seldom even then - is not something we can
really explain to clang.

So it thinks that code is really hot in the inner loop, and does all
kinds of silly things due to that.

Gcc, in contrast, generates the obvious straightforward code, and
that's what I wrote and optimized that code for.

             Linus
Nick Desaulniers Aug. 14, 2022, 9:20 p.m. UTC | #19
On Sun, Aug 14, 2022 at 1:29 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I dug into it some more, and it is really "load_unaligned_zeropad()"
> that makes clang really uncomfortable.
>
> The problem ends up being that clang sees that it's inside that inner
> loop, and tries very hard to optimize the shift-and-mask that happens
> if the exception happens.
>
> The fact that the exception *never* happens unless DEBUG_PAGEALLOC is
> enabled - and very very seldom even then - is not something we can
> really explain to clang.

Probably if we could express that the do_exception label in
load_unaligned_zeropad was cold then that might help.
https://github.com/llvm/llvm-project/issues/46831

Otherwise, could we put the exceptional case statements in a noinline
or cold attributed function?