diff mbox series

[v3,2/2] vfs: avoid duplicating creds in faccessat if possible

Message ID 20230125155557.37816-2-mjguzik@gmail.com (mailing list archive)
State Handled Elsewhere
Headers show
Series [v3,1/2] capability: add cap_isidentical | expand

Commit Message

Mateusz Guzik Jan. 25, 2023, 3:55 p.m. UTC
access(2) remains commonly used, for example on exec:
access("/etc/ld.so.preload", R_OK)

or when running gcc: strace -c gcc empty.c
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.00    0.000000           0        42        26 access

It falls down to do_faccessat without the AT_EACCESS flag, which in turn
results in allocation of new creds in order to modify fsuid/fsgid and
caps. This is a very expensive process single-threaded and most notably
multi-threaded, with numerous structures getting refed and unrefed on
imminent new cred destruction.

Turns out for typical consumers the resulting creds would be identical
and this can be checked upfront, avoiding the hard work.

An access benchmark plugged into will-it-scale running on Cascade Lake
shows:
test	proc	before	after
access1	1	1310582	2908735	 (+121%)  # distinct files
access1	24	4716491	63822173 (+1353%) # distinct files
access2	24	2378041	5370335	 (+125%)  # same file

The above benchmarks are not integrated into will-it-scale, but can be
found in a pull request:
https://github.com/antonblanchard/will-it-scale/pull/36/files

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>

v3:
- add a comment warning about changing access_override_creds
v2:
- fix current->cred usage warn reported by the kernel test robot
Link: https://lore.kernel.org/all/202301150709.9EC6UKBT-lkp@intel.com/
---
 fs/open.c | 38 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

Comments

Linus Torvalds Feb. 28, 2023, 12:44 a.m. UTC | #1
On Wed, Jan 25, 2023 at 7:56 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> Turns out for typical consumers the resulting creds would be identical
> and this can be checked upfront, avoiding the hard work.

I've applied this v3 of the two patches.

Normally it would go through Al, but he's clearly been under the
weather and is drowning in email. Besides, I'm comfortable with this
particular set of patches anyway as I was involved in the previous
round of access() overhead avoidance with the whole RCU grace period
thing.

So I think we're all better off having Al look at any iov_iter issues.

Anybody holler if there are issues,

             Linus
Christian Brauner March 2, 2023, 8:30 a.m. UTC | #2
On Mon, Feb 27, 2023 at 04:44:06PM -0800, Linus Torvalds wrote:
> On Wed, Jan 25, 2023 at 7:56 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > Turns out for typical consumers the resulting creds would be identical
> > and this can be checked upfront, avoiding the hard work.
> 
> I've applied this v3 of the two patches.
> 
> Normally it would go through Al, but he's clearly been under the
> weather and is drowning in email. Besides, I'm comfortable with this
> particular set of patches anyway as I was involved in the previous
> round of access() overhead avoidance with the whole RCU grace period
> thing.
> 
> So I think we're all better off having Al look at any iov_iter issues.
> 
> Anybody holler if there are issues,

Fwiw, as long as you, Al, and others are fine with it and I'm aware of
it I'm happy to pick up more stuff like this. I've done it before and
have worked in this area so I'm happy to help with some of the load.

Christian
Linus Torvalds March 2, 2023, 5:51 p.m. UTC | #3
On Thu, Mar 2, 2023 at 12:30 AM Christian Brauner <brauner@kernel.org> wrote:
>
> Fwiw, as long as you, Al, and others are fine with it and I'm aware of
> it I'm happy to pick up more stuff like this. I've done it before and
> have worked in this area so I'm happy to help with some of the load.

Yeah, that would work. We've actually had discussions of vfs
maintenance in general.

In this case it really wasn't an issue, with this being just two
fairly straightforward patches for code that I was familiar with.

              Linus
Al Viro March 2, 2023, 6:11 p.m. UTC | #4
On Thu, Mar 02, 2023 at 09:30:25AM +0100, Christian Brauner wrote:
> On Mon, Feb 27, 2023 at 04:44:06PM -0800, Linus Torvalds wrote:
> > On Wed, Jan 25, 2023 at 7:56 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > >
> > > Turns out for typical consumers the resulting creds would be identical
> > > and this can be checked upfront, avoiding the hard work.
> > 
> > I've applied this v3 of the two patches.
> > 
> > Normally it would go through Al, but he's clearly been under the
> > weather and is drowning in email. Besides, I'm comfortable with this
> > particular set of patches anyway as I was involved in the previous
> > round of access() overhead avoidance with the whole RCU grace period
> > thing.
> > 
> > So I think we're all better off having Al look at any iov_iter issues.
> > 
> > Anybody holler if there are issues,
> 
> Fwiw, as long as you, Al, and others are fine with it and I'm aware of
> it I'm happy to pick up more stuff like this. I've done it before and
> have worked in this area so I'm happy to help with some of the load.

TBH, I've missed that series; patches look sane, so consider them
belatedly ACKed.

And I've no problem with sharing the load - you have been doing that with
idmapping stuff anyway.  As far as I'm concerned, I generally trust your
taste; it doesn't matter that I won't disagree with specific decisions,
of course, but...
Mateusz Guzik March 2, 2023, 6:14 p.m. UTC | #5
On 3/2/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Thu, Mar 2, 2023 at 12:30 AM Christian Brauner <brauner@kernel.org>
> wrote:
>>
>> Fwiw, as long as you, Al, and others are fine with it and I'm aware of
>> it I'm happy to pick up more stuff like this. I've done it before and
>> have worked in this area so I'm happy to help with some of the load.
>
> Yeah, that would work. We've actually had discussions of vfs
> maintenance in general.
>
> In this case it really wasn't an issue, with this being just two
> fairly straightforward patches for code that I was familiar with.
>

Well on that note I intend to write a patch which would add a flag to
the dentry cache.

There is this thing named CONFIG_INIT_ON_ALLOC_DEFAULT_ON which is
enabled at least on debian, ubuntu and arch. It results in mandatory
memset on the obj before it gets returned from kmalloc, for hardening
purposes.

Now the problem is that dentry cache allocates bufs 4096 bytes in
size, so this is an equivalent of a clear_page call and it happens
*every time* there is a path lookup.

Given how dentry cache is used, I'm confident there is 0 hardening
benefit for it.

So the plan would be to add a flag on cache creation to exempt it from
the mandatory memset treatment and use it with dentry.

I don't have numbers handy but as you can imagine it gave me a nice bump :)

Whatever you think about the idea aside, the q is: can something like
the above go in without Al approving it?

If so, I'll try to find time to hack it up this or next week. I had
some other patches to write too, but $reallife.
Al Viro March 2, 2023, 6:18 p.m. UTC | #6
On Thu, Mar 02, 2023 at 07:14:24PM +0100, Mateusz Guzik wrote:
> On 3/2/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > On Thu, Mar 2, 2023 at 12:30 AM Christian Brauner <brauner@kernel.org>
> > wrote:
> >>
> >> Fwiw, as long as you, Al, and others are fine with it and I'm aware of
> >> it I'm happy to pick up more stuff like this. I've done it before and
> >> have worked in this area so I'm happy to help with some of the load.
> >
> > Yeah, that would work. We've actually had discussions of vfs
> > maintenance in general.
> >
> > In this case it really wasn't an issue, with this being just two
> > fairly straightforward patches for code that I was familiar with.
> >
> 
> Well on that note I intend to write a patch which would add a flag to
> the dentry cache.
> 
> There is this thing named CONFIG_INIT_ON_ALLOC_DEFAULT_ON which is
> enabled at least on debian, ubuntu and arch. It results in mandatory
> memset on the obj before it gets returned from kmalloc, for hardening
> purposes.
> 
> Now the problem is that dentry cache allocates bufs 4096 bytes in
> size, so this is an equivalent of a clear_page call and it happens
> *every time* there is a path lookup.

Huh?  Quite a few path lookups don't end up allocating any dentries;
what exactly are you talking about?

> Given how dentry cache is used, I'm confident there is 0 hardening
> benefit for it.
> 
> So the plan would be to add a flag on cache creation to exempt it from
> the mandatory memset treatment and use it with dentry.
> 
> I don't have numbers handy but as you can imagine it gave me a nice bump :)
> 
> Whatever you think about the idea aside, the q is: can something like
> the above go in without Al approving it?

That one I would really like to take a look at.
Mateusz Guzik March 2, 2023, 6:22 p.m. UTC | #7
On 3/2/23, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Mar 02, 2023 at 07:14:24PM +0100, Mateusz Guzik wrote:
>> On 3/2/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> > On Thu, Mar 2, 2023 at 12:30 AM Christian Brauner <brauner@kernel.org>
>> > wrote:
>> >>
>> >> Fwiw, as long as you, Al, and others are fine with it and I'm aware of
>> >> it I'm happy to pick up more stuff like this. I've done it before and
>> >> have worked in this area so I'm happy to help with some of the load.
>> >
>> > Yeah, that would work. We've actually had discussions of vfs
>> > maintenance in general.
>> >
>> > In this case it really wasn't an issue, with this being just two
>> > fairly straightforward patches for code that I was familiar with.
>> >
>>
>> Well on that note I intend to write a patch which would add a flag to
>> the dentry cache.
>>
>> There is this thing named CONFIG_INIT_ON_ALLOC_DEFAULT_ON which is
>> enabled at least on debian, ubuntu and arch. It results in mandatory
>> memset on the obj before it gets returned from kmalloc, for hardening
>> purposes.
>>
>> Now the problem is that dentry cache allocates bufs 4096 bytes in
>> size, so this is an equivalent of a clear_page call and it happens
>> *every time* there is a path lookup.
>
> Huh?  Quite a few path lookups don't end up allocating any dentries;
> what exactly are you talking about?
>

Ops, I meant "names_cache", here:
	names_cachep = kmem_cache_create_usercopy("names_cache", PATH_MAX, 0,
			SLAB_HWCACHE_ALIGN|SLAB_PANIC, 0, PATH_MAX, NULL);

it is fs/dcache.c and I brainfarted into the above.

>> Given how dentry cache is used, I'm confident there is 0 hardening
>> benefit for it.
>>
>> So the plan would be to add a flag on cache creation to exempt it from
>> the mandatory memset treatment and use it with dentry.
>>
>> I don't have numbers handy but as you can imagine it gave me a nice bump
>> :)
>>
>> Whatever you think about the idea aside, the q is: can something like
>> the above go in without Al approving it?
>
> That one I would really like to take a look at.
>

allright
Al Viro March 2, 2023, 6:41 p.m. UTC | #8
On Thu, Mar 02, 2023 at 06:18:32PM +0000, Al Viro wrote:
> On Thu, Mar 02, 2023 at 07:14:24PM +0100, Mateusz Guzik wrote:
> > On 3/2/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > > On Thu, Mar 2, 2023 at 12:30 AM Christian Brauner <brauner@kernel.org>
> > > wrote:
> > >>
> > >> Fwiw, as long as you, Al, and others are fine with it and I'm aware of
> > >> it I'm happy to pick up more stuff like this. I've done it before and
> > >> have worked in this area so I'm happy to help with some of the load.
> > >
> > > Yeah, that would work. We've actually had discussions of vfs
> > > maintenance in general.
> > >
> > > In this case it really wasn't an issue, with this being just two
> > > fairly straightforward patches for code that I was familiar with.
> > >
> > 
> > Well on that note I intend to write a patch which would add a flag to
> > the dentry cache.
> > 
> > There is this thing named CONFIG_INIT_ON_ALLOC_DEFAULT_ON which is
> > enabled at least on debian, ubuntu and arch. It results in mandatory
> > memset on the obj before it gets returned from kmalloc, for hardening
> > purposes.
> > 
> > Now the problem is that dentry cache allocates bufs 4096 bytes in
> > size, so this is an equivalent of a clear_page call and it happens
> > *every time* there is a path lookup.
> 
> Huh?  Quite a few path lookups don't end up allocating any dentries;
> what exactly are you talking about?
> 
> > Given how dentry cache is used, I'm confident there is 0 hardening
> > benefit for it.
> > 
> > So the plan would be to add a flag on cache creation to exempt it from
> > the mandatory memset treatment and use it with dentry.
> > 
> > I don't have numbers handy but as you can imagine it gave me a nice bump :)

Hmm...  Let's see what __d_alloc() will explicitly store into:
[*]	unsigned int d_flags;
[*]	seqcount_spinlock_t d_seq;
[*]	struct hlist_bl_node d_hash;
[*]	struct dentry *d_parent;
[*]	struct qstr d_name;
[*]	struct inode *d_inode;
	unsigned char d_iname[DNAME_INLINE_LEN];
[*]	struct lockref d_lockref;
[*]	const struct dentry_operations *d_op;
[*]	struct super_block *d_sb;
	unsigned long d_time;
[*]	void *d_fsdata;
	union {
[*]		struct list_head d_lru;
		wait_queue_head_t *d_wait;
	};
[*]	struct list_head d_child;
[*]	struct list_head d_subdirs;
	union {
		struct hlist_node d_alias;
		struct hlist_bl_node d_in_lookup_hash;
		struct rcu_head d_rcu;
	} d_u;

These are straightforward "overwrite no matter what".  ->d_time is not
initialized at all - it's fs-private, and unused for the majority
of filesystems (kernfs, nfs and vboxsf are the only users).
->d_alias/->d_in_lookup_hash/->d_rcu are also uninitialized - those
are valid only on some stages of dentry lifecycle (which is why
they can share space) and initialization is responsibility of
the places that do the corresponding state transitions.

->d_iname is the only somewhat delicate one.  I think we are OK as
it is (i.e. that having all of it zeroed out won't buy anything), but
that's not trivial.  Note that the last byte does need to be
initialized, despite the
        memcpy(dname, name->name, name->len);
        dname[name->len] = 0;
following it.

I'm not saying that getting rid of pre-zeroing the entire underlying
page is the wrong thing to do; it's just that it needs some analysis in
commit message.
Al Viro March 2, 2023, 6:43 p.m. UTC | #9
On Thu, Mar 02, 2023 at 07:22:17PM +0100, Mateusz Guzik wrote:

> Ops, I meant "names_cache", here:
> 	names_cachep = kmem_cache_create_usercopy("names_cache", PATH_MAX, 0,
> 			SLAB_HWCACHE_ALIGN|SLAB_PANIC, 0, PATH_MAX, NULL);
> 
> it is fs/dcache.c and I brainfarted into the above.

So you mean __getname() stuff?
Mateusz Guzik March 2, 2023, 6:51 p.m. UTC | #10
On 3/2/23, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Thu, Mar 02, 2023 at 07:22:17PM +0100, Mateusz Guzik wrote:
>
>> Ops, I meant "names_cache", here:
>> 	names_cachep = kmem_cache_create_usercopy("names_cache", PATH_MAX, 0,
>> 			SLAB_HWCACHE_ALIGN|SLAB_PANIC, 0, PATH_MAX, NULL);
>>
>> it is fs/dcache.c and I brainfarted into the above.
>
> So you mean __getname() stuff?
>

yes. do some lookups in a loop on a kernel built with
CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y (there may be a runtime switch for
it?) and  you will see memset using most time in perf top
Al Viro March 2, 2023, 7:02 p.m. UTC | #11
On Thu, Mar 02, 2023 at 06:43:39PM +0000, Al Viro wrote:
> On Thu, Mar 02, 2023 at 07:22:17PM +0100, Mateusz Guzik wrote:
> 
> > Ops, I meant "names_cache", here:
> > 	names_cachep = kmem_cache_create_usercopy("names_cache", PATH_MAX, 0,
> > 			SLAB_HWCACHE_ALIGN|SLAB_PANIC, 0, PATH_MAX, NULL);
> > 
> > it is fs/dcache.c and I brainfarted into the above.
> 
> So you mean __getname() stuff?

The thing is, getname_flags()/getname_kernel() is not the only user of that
thing; grep and you'll see (and keep in mind that cifs alloc_dentry_path()
is a __getname() wrapper, with its own callers).  We might have bugs papered
over^W^Whardened away^W^Wpapered over in some of those users.

I agree that getname_flags()/getname_kernel()/sys_getcwd() have no need of
pre-zeroing; fw_get_filesystem_firmware(), ceph_mdsc_build_path(),
[hostfs]dentry_name() and ima_d_path() seem to be safe.  So's
vboxsf_path_from_dentry() (I think).  But with this bunch I'd need
a review before I'd be willing to say "this security theatre buys us
nothing here":

fs/cifs/cifsproto.h:67: return __getname();
fs/exfat/dir.c:195:     nb->lfn = __getname();
fs/fat/dir.c:287:               *unicode = __getname();
fs/fat/namei_vfat.c:598:        uname = __getname();
fs/ntfs3/dir.c:394:     name = __getname();
fs/ntfs3/inode.c:1289:  new_de = __getname();
fs/ntfs3/inode.c:1694:  de = __getname();
fs/ntfs3/inode.c:1732:  de = __getname();
fs/ntfs3/namei.c:71:    struct cpu_str *uni = __getname();
fs/ntfs3/namei.c:286:   de = __getname();
fs/ntfs3/namei.c:355:   struct cpu_str *uni = __getname();
fs/ntfs3/namei.c:494:   uni = __getname();
fs/ntfs3/namei.c:555:   uni1 = __getname();
fs/ntfs3/xattr.c:532:   buf = __getname();

fs/cifs/cifs_dfs_ref.c:168:     page = alloc_dentry_path();
fs/cifs/cifsacl.c:1697: page = alloc_dentry_path();
fs/cifs/cifsacl.c:1760: page = alloc_dentry_path();
fs/cifs/cifsproto.h:65:static inline void *alloc_dentry_path(void)
fs/cifs/dir.c:187:      void *page = alloc_dentry_path();
fs/cifs/dir.c:604:      page = alloc_dentry_path();
fs/cifs/dir.c:664:      page = alloc_dentry_path();
fs/cifs/file.c:594:     page = alloc_dentry_path();
fs/cifs/file.c:796:     page = alloc_dentry_path();
fs/cifs/file.c:2223:    void *page = alloc_dentry_path();
fs/cifs/file.c:2255:    void *page = alloc_dentry_path();
fs/cifs/inode.c:1663:   page = alloc_dentry_path();
fs/cifs/inode.c:1938:   page = alloc_dentry_path();
fs/cifs/inode.c:2001:   void *page = alloc_dentry_path();
fs/cifs/inode.c:2170:   page1 = alloc_dentry_path();
fs/cifs/inode.c:2171:   page2 = alloc_dentry_path();
fs/cifs/inode.c:2446:   page = alloc_dentry_path();
fs/cifs/inode.c:2738:   void *page = alloc_dentry_path();
fs/cifs/inode.c:2893:   void *page = alloc_dentry_path();
fs/cifs/ioctl.c:34:     void *page = alloc_dentry_path();
fs/cifs/link.c:491:     page1 = alloc_dentry_path();
fs/cifs/link.c:492:     page2 = alloc_dentry_path();
fs/cifs/misc.c:803:     page = alloc_dentry_path();
fs/cifs/readdir.c:1071: void *page = alloc_dentry_path();
fs/cifs/smb2ops.c:2059: void *page = alloc_dentry_path();
fs/cifs/xattr.c:112:    page = alloc_dentry_path();
fs/cifs/xattr.c:277:    page = alloc_dentry_path();
fs/cifs/xattr.c:382:    page = alloc_dentry_path();
Linus Torvalds March 2, 2023, 7:03 p.m. UTC | #12
On Thu, Mar 2, 2023 at 10:22 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> Ops, I meant "names_cache", here:
>         names_cachep = kmem_cache_create_usercopy("names_cache", PATH_MAX, 0,
>                         SLAB_HWCACHE_ALIGN|SLAB_PANIC, 0, PATH_MAX, NULL);

That code just needs a __GFP_SKIP_ZERO.

It might be best if we actually exposed it as a SLAB_SKIP_ZERO thing,
just to make it possible to say - exactly in situations like this -
that this particular slab cache has no advantage from pre-zeroing.

This doesn't sound like a vfs issue, this is a hardening issue where
apparently people now use that INIT_ON_ALLOC_DEFAULT_ON in "real use"
and then you notice how horrid the performance impact can be.

But there might also be some possible interactions with KASAN etc.

Adding some hardening people to the cc.

           Linus
Linus Torvalds March 2, 2023, 7:10 p.m. UTC | #13
On Thu, Mar 2, 2023 at 11:03 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It might be best if we actually exposed it as a SLAB_SKIP_ZERO thing,
> just to make it possible to say - exactly in situations like this -
> that this particular slab cache has no advantage from pre-zeroing.

Actually, maybe it's just as well to keep it per-allocation, and just
special-case getname_flags() itself.

We could replace the __getname() there with just a

        kmem_cache_alloc(names_cachep, GFP_KERNEL | __GFP_SKIP_ZERO);

we're going to overwrite the beginning of the buffer with the path we
copy from user space, and then we'd have to make people comfortable
with the fact that even with zero initialization hardening on, the
space after the filename wouldn't be initialized...

            Linus
Al Viro March 2, 2023, 7:18 p.m. UTC | #14
On Thu, Mar 02, 2023 at 07:02:10PM +0000, Al Viro wrote:
> On Thu, Mar 02, 2023 at 06:43:39PM +0000, Al Viro wrote:
> > On Thu, Mar 02, 2023 at 07:22:17PM +0100, Mateusz Guzik wrote:
> > 
> > > Ops, I meant "names_cache", here:
> > > 	names_cachep = kmem_cache_create_usercopy("names_cache", PATH_MAX, 0,
> > > 			SLAB_HWCACHE_ALIGN|SLAB_PANIC, 0, PATH_MAX, NULL);
> > > 
> > > it is fs/dcache.c and I brainfarted into the above.
> > 
> > So you mean __getname() stuff?
> 
> The thing is, getname_flags()/getname_kernel() is not the only user of that
> thing; grep and you'll see (and keep in mind that cifs alloc_dentry_path()
> is a __getname() wrapper, with its own callers).  We might have bugs papered
> over^W^Whardened away^W^Wpapered over in some of those users.
> 
> I agree that getname_flags()/getname_kernel()/sys_getcwd() have no need of
> pre-zeroing; fw_get_filesystem_firmware(), ceph_mdsc_build_path(),
> [hostfs]dentry_name() and ima_d_path() seem to be safe.  So's
> vboxsf_path_from_dentry() (I think).  But with this bunch I'd need
> a review before I'd be willing to say "this security theatre buys us
> nothing here":

[snip the list]

PS: ripping this bandaid off might very well be the right thing to do, it's just
that "I'm confident there is 0 hardening benefit for it" needs a code review
is some moderately grotty places.  It's not too awful (e.g. in case of cifs
most of the callers are immediately followed by build_path_from_dentry(), which
stores the pathname in the end of page and returns the pointer to beginning
of initialized part; verifying that after that allocation + build_path we
only access the parts past the returned pointer until it's time to free the
buffer is not hard), but it's worth doing.
Al Viro March 2, 2023, 7:19 p.m. UTC | #15
On Thu, Mar 02, 2023 at 11:10:03AM -0800, Linus Torvalds wrote:
> On Thu, Mar 2, 2023 at 11:03 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > It might be best if we actually exposed it as a SLAB_SKIP_ZERO thing,
> > just to make it possible to say - exactly in situations like this -
> > that this particular slab cache has no advantage from pre-zeroing.
> 
> Actually, maybe it's just as well to keep it per-allocation, and just
> special-case getname_flags() itself.
> 
> We could replace the __getname() there with just a
> 
>         kmem_cache_alloc(names_cachep, GFP_KERNEL | __GFP_SKIP_ZERO);
> 
> we're going to overwrite the beginning of the buffer with the path we
> copy from user space, and then we'd have to make people comfortable
> with the fact that even with zero initialization hardening on, the
> space after the filename wouldn't be initialized...

ACK; same in getname_kernel() and sys_getcwd(), at the very least.
Kees Cook March 2, 2023, 7:38 p.m. UTC | #16
On Thu, Mar 02, 2023 at 11:10:03AM -0800, Linus Torvalds wrote:
> On Thu, Mar 2, 2023 at 11:03 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > It might be best if we actually exposed it as a SLAB_SKIP_ZERO thing,
> > just to make it possible to say - exactly in situations like this -
> > that this particular slab cache has no advantage from pre-zeroing.
> 
> Actually, maybe it's just as well to keep it per-allocation, and just
> special-case getname_flags() itself.
> 
> We could replace the __getname() there with just a
> 
>         kmem_cache_alloc(names_cachep, GFP_KERNEL | __GFP_SKIP_ZERO);
> 
> we're going to overwrite the beginning of the buffer with the path we
> copy from user space, and then we'd have to make people comfortable
> with the fact that even with zero initialization hardening on, the
> space after the filename wouldn't be initialized...

Yeah, I'd love to have a way to safely opt-out of always-zero. The
discussion[1] when we originally did this devolved into a guessing
game on performance since no one could actually point to workloads
that were affected by it, beyond skbuff[2]. So in the interest of not
over-engineering a solution to an unknown problem, the plan was once
someone found a problem, we could find a sensible solution at that
time. And so here we are! :)

I'd always wanted to avoid a "don't zero" flag and instead adjust APIs so
the allocation could include a callback to do the memory content filling
that would return a size-that-was-initialized result. That way we don't
end up in the situations we've seen so many times with drivers, etc,
where an uninit buffer is handed off and some path fails to actually
fill it with anything. However, in practice, I think this kind of API
change becomes really hard to do.

-Kees

[1] https://lore.kernel.org/all/20190514143537.10435-4-glider@google.com/
[2] https://lore.kernel.org/all/20190514143537.10435-5-glider@google.com/
Eric Biggers March 2, 2023, 7:48 p.m. UTC | #17
On Thu, Mar 02, 2023 at 11:38:50AM -0800, Kees Cook wrote:
> On Thu, Mar 02, 2023 at 11:10:03AM -0800, Linus Torvalds wrote:
> > On Thu, Mar 2, 2023 at 11:03 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > It might be best if we actually exposed it as a SLAB_SKIP_ZERO thing,
> > > just to make it possible to say - exactly in situations like this -
> > > that this particular slab cache has no advantage from pre-zeroing.
> > 
> > Actually, maybe it's just as well to keep it per-allocation, and just
> > special-case getname_flags() itself.
> > 
> > We could replace the __getname() there with just a
> > 
> >         kmem_cache_alloc(names_cachep, GFP_KERNEL | __GFP_SKIP_ZERO);
> > 
> > we're going to overwrite the beginning of the buffer with the path we
> > copy from user space, and then we'd have to make people comfortable
> > with the fact that even with zero initialization hardening on, the
> > space after the filename wouldn't be initialized...
> 
> Yeah, I'd love to have a way to safely opt-out of always-zero. The
> discussion[1] when we originally did this devolved into a guessing
> game on performance since no one could actually point to workloads
> that were affected by it, beyond skbuff[2]. So in the interest of not
> over-engineering a solution to an unknown problem, the plan was once
> someone found a problem, we could find a sensible solution at that
> time. And so here we are! :)
> 
> I'd always wanted to avoid a "don't zero" flag and instead adjust APIs so
> the allocation could include a callback to do the memory content filling
> that would return a size-that-was-initialized result. That way we don't
> end up in the situations we've seen so many times with drivers, etc,
> where an uninit buffer is handed off and some path fails to actually
> fill it with anything. However, in practice, I think this kind of API
> change becomes really hard to do.
> 

Having not been following init_on_alloc very closely myself, I'm a bit surprised
that an opt-out flag never made it into the final version.

Was names_cachep considered in those earlier discussions?  I think that's a
pretty obvious use case for an opt-out.  Every syscall that operates on a path
allocates a 4K buffer from names_cachep.

- Eric
Kees Cook March 2, 2023, 7:54 p.m. UTC | #18
On Thu, Mar 02, 2023 at 07:19:49PM +0000, Al Viro wrote:
> On Thu, Mar 02, 2023 at 11:10:03AM -0800, Linus Torvalds wrote:
> > On Thu, Mar 2, 2023 at 11:03 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > It might be best if we actually exposed it as a SLAB_SKIP_ZERO thing,
> > > just to make it possible to say - exactly in situations like this -
> > > that this particular slab cache has no advantage from pre-zeroing.
> > 
> > Actually, maybe it's just as well to keep it per-allocation, and just
> > special-case getname_flags() itself.
> > 
> > We could replace the __getname() there with just a
> > 
> >         kmem_cache_alloc(names_cachep, GFP_KERNEL | __GFP_SKIP_ZERO);
> > 
> > we're going to overwrite the beginning of the buffer with the path we
> > copy from user space, and then we'd have to make people comfortable
> > with the fact that even with zero initialization hardening on, the
> > space after the filename wouldn't be initialized...
> 
> ACK; same in getname_kernel() and sys_getcwd(), at the very least.

FWIW, much earlier analysis suggested opting out these kmem caches:

	buffer_head
	names_cache
	mm_struct
	anon_vma
	skbuff_head_cache
	skbuff_fclone_cache

Alexander's analysis more recently[2] of skbuff went a bit further,
I think, and allowed opt-out for non-kmem cache page allocations too.

-Kees

[1] https://lore.kernel.org/all/20190514143537.10435-5-glider@google.com/
Al Viro March 2, 2023, 8:11 p.m. UTC | #19
On Thu, Mar 02, 2023 at 11:54:03AM -0800, Kees Cook wrote:
> On Thu, Mar 02, 2023 at 07:19:49PM +0000, Al Viro wrote:
> > On Thu, Mar 02, 2023 at 11:10:03AM -0800, Linus Torvalds wrote:
> > > On Thu, Mar 2, 2023 at 11:03 AM Linus Torvalds
> > > <torvalds@linux-foundation.org> wrote:
> > > >
> > > > It might be best if we actually exposed it as a SLAB_SKIP_ZERO thing,
> > > > just to make it possible to say - exactly in situations like this -
> > > > that this particular slab cache has no advantage from pre-zeroing.
> > > 
> > > Actually, maybe it's just as well to keep it per-allocation, and just
> > > special-case getname_flags() itself.
> > > 
> > > We could replace the __getname() there with just a
> > > 
> > >         kmem_cache_alloc(names_cachep, GFP_KERNEL | __GFP_SKIP_ZERO);
> > > 
> > > we're going to overwrite the beginning of the buffer with the path we
> > > copy from user space, and then we'd have to make people comfortable
> > > with the fact that even with zero initialization hardening on, the
> > > space after the filename wouldn't be initialized...
> > 
> > ACK; same in getname_kernel() and sys_getcwd(), at the very least.
> 
> FWIW, much earlier analysis suggested opting out these kmem caches:
> 
> 	buffer_head
> 	names_cache
> 	mm_struct
> 	anon_vma
> 	skbuff_head_cache
> 	skbuff_fclone_cache

I would probably add dentry_cache to it; the only subtle part is
->d_iname and I'm convinced that explicit "make sure there's a NUL
at the very end" is enough.
Christian Brauner March 3, 2023, 2:27 p.m. UTC | #20
On Thu, Mar 02, 2023 at 06:11:07PM +0000, Al Viro wrote:
> On Thu, Mar 02, 2023 at 09:30:25AM +0100, Christian Brauner wrote:
> > On Mon, Feb 27, 2023 at 04:44:06PM -0800, Linus Torvalds wrote:
> > > On Wed, Jan 25, 2023 at 7:56 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
> > > >
> > > > Turns out for typical consumers the resulting creds would be identical
> > > > and this can be checked upfront, avoiding the hard work.
> > > 
> > > I've applied this v3 of the two patches.
> > > 
> > > Normally it would go through Al, but he's clearly been under the
> > > weather and is drowning in email. Besides, I'm comfortable with this
> > > particular set of patches anyway as I was involved in the previous
> > > round of access() overhead avoidance with the whole RCU grace period
> > > thing.
> > > 
> > > So I think we're all better off having Al look at any iov_iter issues.
> > > 
> > > Anybody holler if there are issues,
> > 
> > Fwiw, as long as you, Al, and others are fine with it and I'm aware of
> > it I'm happy to pick up more stuff like this. I've done it before and
> > have worked in this area so I'm happy to help with some of the load.
> 
> TBH, I've missed that series; patches look sane, so consider them
> belatedly ACKed.
> 
> And I've no problem with sharing the load - you have been doing that with
> idmapping stuff anyway.  As far as I'm concerned, I generally trust your
> taste; it doesn't matter that I won't disagree with specific decisions,
> of course, but...

Thanks, I appreciate that!

Sure, you won't agree with everything. That's kinda expected for reasons
of preference alone but also simply because there's a lot of stuff that
probably only you know atm. In general, I never pick up any stuff I
can't review myself unless it's in-your-face obvious or it already has
acks from the subject matter experts.

But it is much easier to help maintain vfs stuff now that you made it
clear that you're ok with this. So I'm happy to share more of the vfs
maintenance workload.
Christian Brauner March 3, 2023, 2:49 p.m. UTC | #21
On Thu, Mar 02, 2023 at 09:51:48AM -0800, Linus Torvalds wrote:
> On Thu, Mar 2, 2023 at 12:30 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > Fwiw, as long as you, Al, and others are fine with it and I'm aware of
> > it I'm happy to pick up more stuff like this. I've done it before and
> > have worked in this area so I'm happy to help with some of the load.
> 
> Yeah, that would work. We've actually had discussions of vfs
> maintenance in general.

As I've said to Al in the other thread I'm happy to help more.

> 
> In this case it really wasn't an issue, with this being just two
> fairly straightforward patches for code that I was familiar with.

Yup, I understand.
Alexander Potapenko March 3, 2023, 3:30 p.m. UTC | #22
On Thu, Mar 2, 2023 at 9:11 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Thu, Mar 02, 2023 at 11:54:03AM -0800, Kees Cook wrote:
> > On Thu, Mar 02, 2023 at 07:19:49PM +0000, Al Viro wrote:
> > > On Thu, Mar 02, 2023 at 11:10:03AM -0800, Linus Torvalds wrote:
> > > > On Thu, Mar 2, 2023 at 11:03 AM Linus Torvalds
> > > > <torvalds@linux-foundation.org> wrote:
> > > > >
> > > > > It might be best if we actually exposed it as a SLAB_SKIP_ZERO thing,
> > > > > just to make it possible to say - exactly in situations like this -
> > > > > that this particular slab cache has no advantage from pre-zeroing.
> > > >
> > > > Actually, maybe it's just as well to keep it per-allocation, and just
> > > > special-case getname_flags() itself.
> > > >
> > > > We could replace the __getname() there with just a
> > > >
> > > >         kmem_cache_alloc(names_cachep, GFP_KERNEL | __GFP_SKIP_ZERO);
> > > >
> > > > we're going to overwrite the beginning of the buffer with the path we
> > > > copy from user space, and then we'd have to make people comfortable
> > > > with the fact that even with zero initialization hardening on, the
> > > > space after the filename wouldn't be initialized...
> > >
> > > ACK; same in getname_kernel() and sys_getcwd(), at the very least.
> >
> > FWIW, much earlier analysis suggested opting out these kmem caches:
> >
> >       buffer_head
> >       names_cache
> >       mm_struct
> >       anon_vma
> >       skbuff_head_cache
> >       skbuff_fclone_cache
>
> I would probably add dentry_cache to it; the only subtle part is
> ->d_iname and I'm convinced that explicit "make sure there's a NUL
> at the very end" is enough.

FWIW, a couple of years ago I was looking into implementing the
following scheme for opt-out that I also discussed with Kees:

1. Add a ___GFP_SKIP_ZERO flag that is not intended to be used
explicitly (e.g. disallow passing it to kmalloc(), add a checkpatch.pl
warning). Explicitly passing an opt-out flag to allocation functions
was considered harmful previously:
https://lore.kernel.org/kernel-hardening/20190423083148.GF25106@dhcp22.suse.cz/

2. Define new allocation API that will allow opt-out:

  struct page *alloc_pages_uninit(gfp_t gfp, unsigned int order, const
char *key);
  void *kmalloc_uninit(size_t size, gfp_t flags, const char *key);
  void *kmem_cache_alloc_uninit(struct kmem_cache *, gfp_t flags,
const char *key);

, where @key is an arbitrary string that identifies a single
allocation or a group of allocations.

3. Provide a boot-time flag that holds a comma-separated list of
opt-out keys that actually take effect:

  init_on_alloc.skip="xyz-camera-driver,some_big_buffer".

The rationale behind this two-step mechanism is that despite certain
allocations may be appealing opt-out targets for performance reasons,
some vendors may choose to be on the safe side and explicitly list the
allocations that should not be zeroed.

Several possible enhancements include:
1. A SLAB_NOINIT memcache flag that prohibits cache merging and
disables initialization. Because the number of caches is relatively
small, it might be fine to explicitly pass SLAB_NOINIT at cache
creation sites.
Again, if needed, we could only use this flag as a hint that needs to
be acknowledged by a boot-time option.
2. No opt-out for kmalloc() - instead advise users to promote the
anonymous allocations they want to opt-out to memory caches with
SLAB_NOINIT.
3. Provide an emergency brake that completely disables
___GFP_SKIP_ZERO if a major security breach is discovered.

Extending this idea of per-callsite opt-out we could generate unique
keys for every allocation in the kernel (or e.g. group them together
by the caller name) and decide at runtime if we want to opt them out.
But I am not sure anyone would want this level of control in their distro.
Mateusz Guzik March 3, 2023, 5:39 p.m. UTC | #23
On 3/3/23, Alexander Potapenko <glider@google.com> wrote:
> On Thu, Mar 2, 2023 at 9:11 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>>
>> On Thu, Mar 02, 2023 at 11:54:03AM -0800, Kees Cook wrote:
>> > On Thu, Mar 02, 2023 at 07:19:49PM +0000, Al Viro wrote:
>> > > On Thu, Mar 02, 2023 at 11:10:03AM -0800, Linus Torvalds wrote:
>> > > > On Thu, Mar 2, 2023 at 11:03 AM Linus Torvalds
>> > > > <torvalds@linux-foundation.org> wrote:
>> > > > >
>> > > > > It might be best if we actually exposed it as a SLAB_SKIP_ZERO
>> > > > > thing,
>> > > > > just to make it possible to say - exactly in situations like this
>> > > > > -
>> > > > > that this particular slab cache has no advantage from
>> > > > > pre-zeroing.
>> > > >
>> > > > Actually, maybe it's just as well to keep it per-allocation, and
>> > > > just
>> > > > special-case getname_flags() itself.
>> > > >
>> > > > We could replace the __getname() there with just a
>> > > >
>> > > >         kmem_cache_alloc(names_cachep, GFP_KERNEL |
>> > > > __GFP_SKIP_ZERO);
>> > > >
>> > > > we're going to overwrite the beginning of the buffer with the path
>> > > > we
>> > > > copy from user space, and then we'd have to make people comfortable
>> > > > with the fact that even with zero initialization hardening on, the
>> > > > space after the filename wouldn't be initialized...
>> > >
>> > > ACK; same in getname_kernel() and sys_getcwd(), at the very least.
>> >
>> > FWIW, much earlier analysis suggested opting out these kmem caches:
>> >
>> >       buffer_head
>> >       names_cache
>> >       mm_struct
>> >       anon_vma
>> >       skbuff_head_cache
>> >       skbuff_fclone_cache
>>
>> I would probably add dentry_cache to it; the only subtle part is
>> ->d_iname and I'm convinced that explicit "make sure there's a NUL
>> at the very end" is enough.
>
> FWIW, a couple of years ago I was looking into implementing the
> following scheme for opt-out that I also discussed with Kees:
>
> 1. Add a ___GFP_SKIP_ZERO flag that is not intended to be used
> explicitly (e.g. disallow passing it to kmalloc(), add a checkpatch.pl
> warning). Explicitly passing an opt-out flag to allocation functions
> was considered harmful previously:
> https://lore.kernel.org/kernel-hardening/20190423083148.GF25106@dhcp22.suse.cz/
>
> 2. Define new allocation API that will allow opt-out:
>
>   struct page *alloc_pages_uninit(gfp_t gfp, unsigned int order, const
> char *key);
>   void *kmalloc_uninit(size_t size, gfp_t flags, const char *key);
>   void *kmem_cache_alloc_uninit(struct kmem_cache *, gfp_t flags,
> const char *key);
>
> , where @key is an arbitrary string that identifies a single
> allocation or a group of allocations.
>
> 3. Provide a boot-time flag that holds a comma-separated list of
> opt-out keys that actually take effect:
>
>   init_on_alloc.skip="xyz-camera-driver,some_big_buffer".
>
> The rationale behind this two-step mechanism is that despite certain
> allocations may be appealing opt-out targets for performance reasons,
> some vendors may choose to be on the safe side and explicitly list the
> allocations that should not be zeroed.
>
> Several possible enhancements include:
> 1. A SLAB_NOINIT memcache flag that prohibits cache merging and
> disables initialization. Because the number of caches is relatively
> small, it might be fine to explicitly pass SLAB_NOINIT at cache
> creation sites.
> Again, if needed, we could only use this flag as a hint that needs to
> be acknowledged by a boot-time option.
> 2. No opt-out for kmalloc() - instead advise users to promote the
> anonymous allocations they want to opt-out to memory caches with
> SLAB_NOINIT.
> 3. Provide an emergency brake that completely disables
> ___GFP_SKIP_ZERO if a major security breach is discovered.
>
> Extending this idea of per-callsite opt-out we could generate unique
> keys for every allocation in the kernel (or e.g. group them together
> by the caller name) and decide at runtime if we want to opt them out.
> But I am not sure anyone would want this level of control in their distro.
>

I intended to write a longer e-mail concerning the entire
init-on-alloc situation along with some patches in not-so-distant
future, but the bare minimum I wrote previously already got numerous
people involved (unsurprisingly now that I look at it). I don't have
time to write the thing I wanted at the moment, but now that there is
traffic I think I should at least mention one other important bit.

I'm not going to argue for any particular level of granularity -- I'm
a happy camper as long as I can end up with names_cachep allocations
excluded without disabling the entire thing.

So the key is: memset is underperforming at least on x86-64 for
certain sizes and the init-on-alloc thingy makes it used significantly
more, exacerbating the problem. Fixing it is worthwhile on its own and
will reduce the impact of the option, but the need for some form of
opt-out will remain.

I don't have any numbers handy nor time to produce them, so the mail
will be a little handwave-y. I only write it in case someone decides
to bench now what would warrant exclusion with the mechanism under
discussion. Should any of the claims below be challenged, I can
respond some time late next week with hard data(tm).

Onto the issue:
Most cpus in use today have the ERMS bit, in which case the routine is:

SYM_FUNC_START_LOCAL(memset_erms)
        movq %rdi,%r9
        movb %sil,%al
        movq %rdx,%rcx
        rep stosb
        movq %r9,%rax
        RET
SYM_FUNC_END(memset_erms)

The problem here is that instructions with the rep prefix have a very
high startup latency. Afair this remains true even on cpus with FSRM
in case of rep *stos* (what is helped by FSRM is rep *mov*, whereas
stos remains unaffected).

Interestingly looks like the problem was recognized in general --
memcpy and copy_{to,from}_user have hand rolled smaller copies. Apart
from that __clear_user relatively recently got a treatment of that
sort but it somehow never got implemented in memset itself.

If memory serves right, the least slow way to do it is to *NOT* use
rep stos below 128 bytes of size (and this number is even higher the
better the cpu). Instead, a 32-byte loop (as in 4 times movsq) would
do it as long as there is space, along with overlapping stores to take
care of whatever < 32 bytes.

__clear_user got rep stos if FSRM is present and 64 byte non-rep
treatment, with an 8 byte loop and 1 byte loop to cover the tail. As
noted above, this leaves perf on the table. Even then, it would be an
improvement for memset if transplanted over. Maybe this was mentioned
in the discussion concerning the func, I had not read the thread -- I
only see that memset remains unpatched.

memset, even with init-on-alloc disabled, is used *a lot* with very
small sizes. For that bit I do have data collected over 'make' in the
kernel directory.

bpftrace -e 'kprobe:memset { @ = lhist(arg2, 0, 128, 8); }'

[0, 8)           9126030 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@          |
[8, 16)           515728 |@@                                                  |
[16, 24)          621902 |@@                                                  |
[24, 32)          110822 |                                                    |
[32, 40)        11003451 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@  |
[40, 48)             488 |                                                    |
[48, 56)             164 |                                                    |
[56, 64)         1493851 |@@@@@@                                              |
[64, 72)          214613 |                                                    |
[72, 80)           10468 |                                                    |
[80, 88)           10524 |                                                    |
[88, 96)             121 |                                                    |
[96, 104)          81591 |                                                    |
[104, 112)       1659699 |@@@@@@@                                             |
[112, 120)          3240 |                                                    |
[120, 128)          9058 |                                                    |
[128, ...)      11287204 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

note: i failed to figure out how to attach to memset on stock kernel,
thus the kernel was booted with the crappery below:

diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S
index 6143b1a6fa2c..141d899d5f1d 100644
--- a/arch/x86/lib/memset_64.S
+++ b/arch/x86/lib/memset_64.S
@@ -45,9 +45,6 @@ SYM_FUNC_START(__memset)
 SYM_FUNC_END(__memset)
 EXPORT_SYMBOL(__memset)

-SYM_FUNC_ALIAS(memset, __memset)
-EXPORT_SYMBOL(memset)
-
 /*
  * ISO C memset - set a memory block to a byte value. This function uses
  * enhanced rep stosb to override the fast string function.
diff --git a/fs/open.c b/fs/open.c
index 4401a73d4032..6e11e95ad9c3 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1564,3 +1564,11 @@ int stream_open(struct inode *inode, struct file *filp)
 }

 EXPORT_SYMBOL(stream_open);
+
+void *(memset)(void *s, int c, size_t n)
+{
+       return __memset(s, c, n);
+}
+
+EXPORT_SYMBOL(memset);
Linus Torvalds March 3, 2023, 5:54 p.m. UTC | #24
On Fri, Mar 3, 2023 at 9:39 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> So the key is: memset is underperforming at least on x86-64 for
> certain sizes and the init-on-alloc thingy makes it used significantly
> more, exacerbating the problem

One reason that the kernel memset isn't as optimized as memcpy, is
simply because under normal circumstances it shouldn't be *used* that
much outside of page clearing and constant-sized structure
initialization.

Page clearing is fine, and constant-sized structure inits are also
generally fine (ie the compiler does the small ones directly).

So this is literally a problem with pointless automated memset,
introduced by that hardening option. And hardening people generally
simply don't care about performance, and the people who _do _care
about performance usually don't enable the known-expensive crazy
stuff.

Honestly, I think the INIT_ONCE stuff is actively detrimental, and
only hides issues (and in this case adds its own). So I can't but help
to say "don't do that then". I think it's literally stupid to clear
allocations by default.

I'm not opposed to improving memset, but honestly, if the argument is
based on the stupid hardening behavior, I really think *that* needs to
be fixed first.

               Linus
Mateusz Guzik March 3, 2023, 7:37 p.m. UTC | #25
On 3/3/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, Mar 3, 2023 at 9:39 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> So the key is: memset is underperforming at least on x86-64 for
>> certain sizes and the init-on-alloc thingy makes it used significantly
>> more, exacerbating the problem
>
> One reason that the kernel memset isn't as optimized as memcpy, is
> simply because under normal circumstances it shouldn't be *used* that
> much outside of page clearing and constant-sized structure
> initialization.
>

I mentioned in the previous e-mail that memset is used a lot even
without the problematic opt and even have shown size distribution of
what's getting passed there.

You made me curious how usage compares to memcpy, so I checked 'make'
in kernel dir once again.

I stress the init-on-alloc opt is *OFF*:
# CONFIG_INIT_ON_ALLOC_DEFAULT_ON is not set
# CONFIG_INIT_ON_FREE_DEFAULT_ON is not set

# bpftrace -e 'kprobe:memset,kprobe:memcpy { @[probe] = count(); }'
[snip]
@[kprobe:memcpy]: 6948498
@[kprobe:memset]: 36150671

iow memset is used about 7 times as much as memcpy when building the
kernel. If it matters I'm building on top of
2eb29d59ddf02e39774abfb60b2030b0b7e27c1f (reasonably fresh master).

> So this is literally a problem with pointless automated memset,
> introduced by that hardening option. And hardening people generally
> simply don't care about performance, and the people who _do _care
> about performance usually don't enable the known-expensive crazy
> stuff.
>
> Honestly, I think the INIT_ONCE stuff is actively detrimental, and
> only hides issues (and in this case adds its own). So I can't but help
> to say "don't do that then". I think it's literally stupid to clear
> allocations by default.
>

Questioning usability of the entire thing was on the menu in what I
intended to write, but I did not read entire public discussion so
perhaps I missed a crucial insight.

> I'm not opposed to improving memset, but honestly, if the argument is
> based on the stupid hardening behavior, I really think *that* needs to
> be fixed first.
>

It is not. The argument is that this is a core primitive, used a lot
with sizes where rep stos is detrimental to its performance. There is
no urgency, but eventually someone(tm) should sort it out. For
$reasons I don't want to do it myself.

I do bring it up in the context of the init-on-alloc machinery because
it disfigures whatever results are obtained testing on x86-64 -- they
can get exactly the same effect for much smaller cost, consequently
they should have interest in sorting this out.

General point though was that the work should have sanity-checked
performance of the primitive it relies on, instead of assuming it is
~optimal. I'm guilty of this mistake myself, so not going to throw
stones.
Mateusz Guzik March 3, 2023, 7:38 p.m. UTC | #26
On 3/3/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On 3/3/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> On Fri, Mar 3, 2023 at 9:39 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>>
>>> So the key is: memset is underperforming at least on x86-64 for
>>> certain sizes and the init-on-alloc thingy makes it used significantly
>>> more, exacerbating the problem
>>
>> One reason that the kernel memset isn't as optimized as memcpy, is
>> simply because under normal circumstances it shouldn't be *used* that
>> much outside of page clearing and constant-sized structure
>> initialization.
>>
>
> I mentioned in the previous e-mail that memset is used a lot even
> without the problematic opt and even have shown size distribution of
> what's getting passed there.
>
> You made me curious how usage compares to memcpy, so I checked 'make'
> in kernel dir once again.
>
> I stress the init-on-alloc opt is *OFF*:
> # CONFIG_INIT_ON_ALLOC_DEFAULT_ON is not set
> # CONFIG_INIT_ON_FREE_DEFAULT_ON is not set
>
> # bpftrace -e 'kprobe:memset,kprobe:memcpy { @[probe] = count(); }'
> [snip]
> @[kprobe:memcpy]: 6948498
> @[kprobe:memset]: 36150671
>
> iow memset is used about 7 times as much as memcpy when building the
> kernel. If it matters I'm building on top of
> 2eb29d59ddf02e39774abfb60b2030b0b7e27c1f (reasonably fresh master).
>
>> So this is literally a problem with pointless automated memset,
>> introduced by that hardening option. And hardening people generally
>> simply don't care about performance, and the people who _do _care
>> about performance usually don't enable the known-expensive crazy
>> stuff.
>>
>> Honestly, I think the INIT_ONCE stuff is actively detrimental, and
>> only hides issues (and in this case adds its own). So I can't but help
>> to say "don't do that then". I think it's literally stupid to clear
>> allocations by default.
>>
>
> Questioning usability of the entire thing was on the menu in what I
> intended to write, but I did not read entire public discussion so
> perhaps I missed a crucial insight.
>
>> I'm not opposed to improving memset, but honestly, if the argument is
>> based on the stupid hardening behavior, I really think *that* needs to
>> be fixed first.
>>
>
> It is not. The argument is that this is a core primitive, used a lot
> with sizes where rep stos is detrimental to its performance. There is
> no urgency, but eventually someone(tm) should sort it out. For
> $reasons I don't want to do it myself.
>
> I do bring it up in the context of the init-on-alloc machinery because
> it disfigures whatever results are obtained testing on x86-64 -- they
> can get exactly the same effect for much smaller cost, consequently
> they should have interest in sorting this out.
>
> General point though was that the work should have sanity-checked
> performance of the primitive it relies on, instead of assuming it is
> ~optimal. I'm guilty of this mistake myself, so not going to throw
> stones.
>

Forgot to paste the crapper I used to make both visible to bpftrace.

I'm guessing there is a sensible way, I just don't see it and would
love an instruction :)

diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index a64017602010..c40084308d58 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -43,9 +43,6 @@ SYM_TYPED_FUNC_START(__memcpy)
 SYM_FUNC_END(__memcpy)
 EXPORT_SYMBOL(__memcpy)

-SYM_FUNC_ALIAS(memcpy, __memcpy)
-EXPORT_SYMBOL(memcpy)
-
 /*
  * memcpy_erms() - enhanced fast string memcpy. This is faster and
  * simpler than memcpy. Use memcpy_erms when possible.
diff --git a/arch/x86/lib/memset_64.S b/arch/x86/lib/memset_64.S
index 6143b1a6fa2c..141d899d5f1d 100644
--- a/arch/x86/lib/memset_64.S
+++ b/arch/x86/lib/memset_64.S
@@ -45,9 +45,6 @@ SYM_FUNC_START(__memset)
 SYM_FUNC_END(__memset)
 EXPORT_SYMBOL(__memset)

-SYM_FUNC_ALIAS(memset, __memset)
-EXPORT_SYMBOL(memset)
-
 /*
  * ISO C memset - set a memory block to a byte value. This function uses
  * enhanced rep stosb to override the fast string function.
diff --git a/fs/open.c b/fs/open.c
index 4401a73d4032..a3383391bd17 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1564,3 +1564,19 @@ int stream_open(struct inode *inode, struct file *filp)
 }

 EXPORT_SYMBOL(stream_open);
+
+void *(memset)(void *s, int c, size_t n)
+{
+       return __memset(s, c, n);
+}
+
+EXPORT_SYMBOL(memset);
+
+
+void *(memcpy)(void *d, const void *s, size_t n)
+{
+       return __memcpy(d, s, n);
+}
+
+EXPORT_SYMBOL(memcpy);
Linus Torvalds March 3, 2023, 8:08 p.m. UTC | #27
On Fri, Mar 3, 2023 at 11:37 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> I mentioned in the previous e-mail that memset is used a lot even
> without the problematic opt and even have shown size distribution of
> what's getting passed there.

Well, I *have* been pushing Intel to try to fix memcpy and memset for
over two decades by now, but with FSRM I haven't actually seen the
huge problems any more.

It may just be that the loads I look at don't have issues (or the
machines I've done profiles on don't tend to show them as much).

Hmm. Just re-did my usual kernel profile. It may also be that
something has changed. I do see "clear_page" at the top, but yes,
memset is higher up than I remembered.

I actually used to have the reverse of your hack for this - I've had
various hacks over the year that made memcpy and memset be inlined
"rep movs/stos", which (along with inlined spinlocks) is a great way
to see the _culprit_ (without having to deal with the call chains -
which always get done the wrong way around imnsho).

            Linus
Mateusz Guzik March 3, 2023, 8:39 p.m. UTC | #28
On 3/3/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, Mar 3, 2023 at 11:37 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> I mentioned in the previous e-mail that memset is used a lot even
>> without the problematic opt and even have shown size distribution of
>> what's getting passed there.
>
> Well, I *have* been pushing Intel to try to fix memcpy and memset for
> over two decades by now, but with FSRM I haven't actually seen the
> huge problems any more.
>

rep *stos* remains crap with FSRM, but I don't have sensible tests
handy nor the ice lake cpu i tested on at the moment

> I actually used to have the reverse of your hack for this - I've had
> various hacks over the year that made memcpy and memset be inlined
> "rep movs/stos", which (along with inlined spinlocks) is a great way
> to see the _culprit_ (without having to deal with the call chains -
> which always get done the wrong way around imnsho).
>

that's all hackery which makes sense pre tooling like bpftrace, people
can do better now (see the second part of the email)

I think there is a systemic problem which comes with the kzalloc API, consider:
static struct file *__alloc_file(int flags, const struct cred *cred)
{
	struct file *f;
	int error;

	f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
	if (unlikely(!f))
		return ERR_PTR(-ENOMEM);
        [bunch of the struct gets initialized here]

the allocation routine does not have any information about the size
available at compilation time, so has to resort to a memset call at
runtime. Instead, should this be:

f = kmem_cache_alloc(...);
memset(f, 0, sizeof(*f));

... the compiler could in principle inititalize stuff as indicated by
code and emit zerofill for the rest. Interestingly, last I checked
neither clang nor gcc knew how to do it, they instead resort to a full
sized memset anyway, which is quite a bummer.

Personally i grew up on dtrace, bpftrace I can barely use and don't
know how to specifically get the caller, but kstack(2) seems like a
good enough workaround.

as an example here is a one-liner to show crappers which do 0-sized ops:
bpftrace -e 'kprobe:memset,kprobe:memcpy /arg2 == 0/ { @[probe,
kstack(2)] = count(); }'

one can trace all kinds of degeneracy like that without recompiling
anything, provided funcs are visible to bpftrace

sample result from the above one-liner while doing 'make clean' in the
kernel dir:
@[kprobe:memcpy,
    memcpy+5
    realloc_array+78
]: 1
@[kprobe:memcpy,
    memcpy+5
    push_jmp_history+125
]: 1
@[kprobe:memset,
    memset+5
    blk_mq_dispatch_rq_list+687
]: 3
@[kprobe:memcpy,
    memcpy+5
    mix_interrupt_randomness+192
]: 4
@[kprobe:memcpy,
    memcpy+5
    d_alloc_pseudo+18
]: 59
@[kprobe:memcpy,
    memcpy+5
    add_device_randomness+111
]: 241
@[kprobe:memcpy,
    memcpy+5
    add_device_randomness+93
]: 527
@[kprobe:memset,
    memset+5
    copy_process+2904
]: 2054
@[kprobe:memset,
    memset+5
    dup_fd+283
]: 6162
Linus Torvalds March 3, 2023, 8:58 p.m. UTC | #29
On Fri, Mar 3, 2023 at 12:39 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> I think there is a systemic problem which comes with the kzalloc API

Well, it's not necessarily the API that is bad, but the implementation.

We could easily make kzalloc() with a constant size just expand to
kmalloc+memset, and get the behavior you want.

We already do magical things for "find the right slab bucket" part of
kmalloc too for constant sizes. It's changed over the years, but that
policy goes back a long long time. See

   https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=95203fe78007f9ab3aebb96606473ae18c00a5a8

from the BK history tree.

Exactly because some things are worth optimizing for when the size is
known at compile time.

Maybe just extending kzalloc() similarly? Trivial and entirely untested patch:

   --- a/include/linux/slab.h
   +++ b/include/linux/slab.h
   @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct
kmem_cache *k, gfp_t flags)
     */
    static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
    {
   +    if (__builtin_constant_p(size)) {
   +            void *ret = kmalloc(size, flags);
   +            if (ret)
   +                    memset(ret, 0, size);
   +            return ret;
   +    }
        return kmalloc(size, flags | __GFP_ZERO);
    }

This may well be part of what has changed over the years. People have
done a *lot* of pseudo-automated "kmalloc+memset -> kzalloc" code
simplification. And in the process we've lost a lot of good
optimizations.

I used to do profiling religiously, but these days I only do it for
particular areas (usually just the walking part of pathname lookup)

             Linus
Mateusz Guzik March 3, 2023, 9:09 p.m. UTC | #30
On 3/3/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, Mar 3, 2023 at 12:39 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> I think there is a systemic problem which comes with the kzalloc API
>
> Well, it's not necessarily the API that is bad, but the implementation.
>
> We could easily make kzalloc() with a constant size just expand to
> kmalloc+memset, and get the behavior you want.
>
> We already do magical things for "find the right slab bucket" part of
> kmalloc too for constant sizes. It's changed over the years, but that
> policy goes back a long long time. See
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=95203fe78007f9ab3aebb96606473ae18c00a5a8
>
> from the BK history tree.
>
> Exactly because some things are worth optimizing for when the size is
> known at compile time.
>
> Maybe just extending kzalloc() similarly? Trivial and entirely untested
> patch:
>
>    --- a/include/linux/slab.h
>    +++ b/include/linux/slab.h
>    @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct
> kmem_cache *k, gfp_t flags)
>      */
>     static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
>     {
>    +    if (__builtin_constant_p(size)) {
>    +            void *ret = kmalloc(size, flags);
>    +            if (ret)
>    +                    memset(ret, 0, size);
>    +            return ret;
>    +    }
>         return kmalloc(size, flags | __GFP_ZERO);
>     }
>
> This may well be part of what has changed over the years. People have
> done a *lot* of pseudo-automated "kmalloc+memset -> kzalloc" code
> simplification. And in the process we've lost a lot of good
> optimizations.

I was about to write that kzalloc can use automagic treatment. I made
a change of similar sort years back in FreeBSD
https://cgit.freebsd.org/src/commit/?id=34c538c3560591a3856e85988b0b5eefdde53b0c

The crux of the comment though was not about kzalloc (another
brainfart, apologies), but kmem_cache_zalloc -- that one is kind of
screwed as is.

Perhaps it would be unscrewed if calls could be converted to something
in the lines of kmem_cache_zalloc_ptr(cachep, flags, returnobj);

so this from __alloc_file:
        struct file *f;
	int error;

	f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
        if (unlikely(!f))
		return ERR_PTR(-ENOMEM);

could be:
	if (unlikely(!kmem_cache_zalloc_ptr(filp_cachep, GFP_KERNEL, f))
		return ERR_PTR(-ENOMEM);

... where the macro rolls with similar treatment to the one you pasted
for kzalloc. and assigns to f.

if this sounds acceptable coccinelle could be used to do a sweep

I don't have a good name for it.
Linus Torvalds March 4, 2023, 1:29 a.m. UTC | #31
On Fri, Mar 3, 2023 at 12:39 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> as an example here is a one-liner to show crappers which do 0-sized ops:
> bpftrace -e 'kprobe:memset,kprobe:memcpy /arg2 == 0/ { @[probe,
> kstack(2)] = count(); }'

Looking not at 0-sized ops, but crazy small memset() ops, at least
some of these seem to have grown from the bitmap "optimizations".

In particular, 'cpumask_clear()' should just zero the cpumask, and on
the config I use, I have

    CONFIG_NR_CPUS=64

so it should literally just be a single "store zero to cpumask word".
And that's what it used to be.

But then we had commit aa47a7c215e7 ("lib/cpumask: deprecate
nr_cpumask_bits") and suddenly 'nr_cpumask_bits' isn't a simple
constant any more for the "small mask that fits on stack" case, and
instead you end up with code like

        movl    nr_cpu_ids(%rip), %edx
        addq    $63, %rdx
        shrq    $3, %rdx
        andl    $-8, %edx
        ..
        callq   memset@PLT

that does a 8-byte memset because I have 32 cores and 64 threads.

Now, at least some distro kernels seem to be built with CONFIG_MAXSMP,
so CONFIG_NR_CPUS is something insane (namely 8192), and then it is
indeed better to calculate some minimum size instead of doing a 1kB
memset().

But yes, it does look like we've had several regressions in various
areas, where we do insane "memset()" calls with variable size that
should never have been that. Both with kzalloc() and with cpumasks.

There are probably lots of other cases, I just happened to notice the
cpumask one when looking at "why is there a memset call in a function
that shouldn't have any at all".

I don't think this is a "rep stos" problem in general. Obviously it
might be in some cases, but I do think that the deeper problem is that
those small memset() calls generally should not have existed at all,
and were actual bugs. That cpumask_clear() should never have been a
function call in the first place for the "clear one or a couple of
words" case.

The kernel very seldom acts on byte data, so a memset() or memcpy()
that looks like individual bytes is some very odd stuff. We have lots
of often fairly small structures with fixed sizes, though. And maybe
there are other cases like that bogus "let's try to optimize the
bitmap range"...

Of course, then we *do* have distros that do actively detrimental
things. You have Ubuntu with that CONFIG_INIT_ON_ALLOC_DEFAULT_ON, and
if I didn't just rebuild my kernel with sane config options, I'd have
Fedora with CONFIG_MAXSMP that is meant to be a "check worst case"
option, not something you *use*.

Oh well.

          Linus
Yury Norov March 4, 2023, 3:25 a.m. UTC | #32
[...]

> In particular, 'cpumask_clear()' should just zero the cpumask, and on
> the config I use, I have
> 
>     CONFIG_NR_CPUS=64
> 
> so it should literally just be a single "store zero to cpumask word".
> And that's what it used to be.
> 
> But then we had commit aa47a7c215e7 ("lib/cpumask: deprecate
> nr_cpumask_bits") and suddenly 'nr_cpumask_bits' isn't a simple
> constant any more for the "small mask that fits on stack" case, and
> instead you end up with code like
> 
>         movl    nr_cpu_ids(%rip), %edx
>         addq    $63, %rdx
>         shrq    $3, %rdx
>         andl    $-8, %edx
>         ..
>         callq   memset@PLT
> 
> that does a 8-byte memset because I have 32 cores and 64 threads.

Did you enable CONFIG_FORCE_NR_CPUS? If you pick it, the kernel will
bind nr_cpu_ids to NR_CPUS at compile time, and the memset() call
should disappear.

Depending on your compiler you might want to apply this patch as well:

https://lore.kernel.org/lkml/20221027043810.350460-2-yury.norov@gmail.com/

> Now, at least some distro kernels seem to be built with CONFIG_MAXSMP,
> so CONFIG_NR_CPUS is something insane (namely 8192), and then it is
> indeed better to calculate some minimum size instead of doing a 1kB
> memset().

Ubuntu too. That was one of the reasons for the patch.

Thanks,
Yury
Linus Torvalds March 4, 2023, 3:42 a.m. UTC | #33
On Fri, Mar 3, 2023 at 7:25 PM Yury Norov <yury.norov@gmail.com> wrote:
>
> Did you enable CONFIG_FORCE_NR_CPUS? If you pick it, the kernel will
> bind nr_cpu_ids to NR_CPUS at compile time, and the memset() call
> should disappear.

I do not believe CONFIG_FORCE_NR_CPUS makes any sense, and I think I
told you so at the time.

This all used to just work *without* some kind of config thing, First
removing the automatic "do the right thing", and then adding a config
option to "force" doing the right thing seems more than a bit silly to
me.

I think CONFIG_FORCE_NR_CPUS should go away, and - once more - become
just the "is the cpumask small enough to be just allocated directly"
thing.

Of course, the problem for others remain that distros will do that
CONFIG_CPUMASK_OFFSTACK thing, and then things will suck regardless.

I was *so* happy with our clever "you can have large cpumasks, and
we'll just allocate them off the stack" long long ago, because it
meant that we could have one single source tree where this was all
cleanly abstracted away, and we even had nice types and type safety
for it all.

That meant that we could support all the fancy SGI machines with
several thousand cores, and it all "JustWorked(tm)", and didn't make
the normal case any worse.

I didn't expect distros to then go "ooh, we want that too", and enable
it all by default, and make all our clever "you only see this
indirection if you need it" go away, and now the normal case is the
*bad* case, unless you just build your own kernel and pick sane
defaults.

Oh well.

                   Linus
Yury Norov March 4, 2023, 5:51 a.m. UTC | #34
On Fri, Mar 03, 2023 at 07:42:36PM -0800, Linus Torvalds wrote:
> ing: quoted-printable
> Status: O
> Content-Length: 1593
> Lines: 41
> 
> On Fri, Mar 3, 2023 at 7:25 PM Yury Norov <yury.norov@gmail.com> wrote:
> >
> > Did you enable CONFIG_FORCE_NR_CPUS? If you pick it, the kernel will
> > bind nr_cpu_ids to NR_CPUS at compile time, and the memset() call
> > should disappear.
> 
> I do not believe CONFIG_FORCE_NR_CPUS makes any sense, and I think I
> told you so at the time.

At that time you was OK with CONFIG_FORCE_NR_CPUS, only suggested to
hide it behind CONFIG_EXPERT:

https://lore.kernel.org/all/Yzx4fSmmr8bh6gdl@yury-laptop/T/#m92d405527636154c3b2000e0105379170d988315
 
> This all used to just work *without* some kind of config thing, First
> removing the automatic "do the right thing", and then adding a config
> option to "force" doing the right thing seems more than a bit silly to
> me.
> 
> I think CONFIG_FORCE_NR_CPUS should go away, and - once more - become
> just the "is the cpumask small enough to be just allocated directly"
> thing.

This all was just broken. For example, as I mentioned in commit message,
cpumask_full() was broken. I know because I wrote a test. There were no
a single user for the function, and nobody complained. Now we have one
in BPF code. So if we simply revert the aa47a7c215e, it will hurt real
users.

The pre-CONFIG_FORCE_NR_CPUS cpumask machinery would work only if you
set NR_CPUS to the number that matches to the actual number of CPUs as
detected at boot time.

In your example, if you have NR_CPUS == 64, and for some reason disable
hyper threading, nr_cpumask_bits will be set to 64 at compile time, but
nr_cpu_ids will be set to 32 at boot time, assuming
CONFIG_CPUMASK_OFFSTACK is disabled.

And the following code will be broken:

cpumask_t m1, m2;

cpumask_setall(m1); // m1 is ffff ffff ffff ffff because it uses
                    // compile-time optimized nr_cpumask_bits

for_each_cpu(cpu, m1) // 32 iterations because it relied on nr_cpu_ids
        cpumask_set_cpu(cpu, m2); // m2 is ffff ffff XXXX XXXX

BUG_ON(!cpumask_equal(m1, m2)); // Bug because it will test all 64 bits

Today with CONFIG_FORCE_NR_CPUS disabled, kernel consistently relies
on boot-time defined nr_cpu_ids in functions like cpumask_equal()
with the cost of disabled runtime optimizations.

If CONFIG_FORCE_NR_CPUS is enabled, it wires nr_cpu_ids to NR_CPUS
at compile time, which allows compile-time optimization.

If CONFIG_FORCE_NR_CPUS is enabled, but actual number of CPUs doesn't
match to NR_CPUS, the kernel throws a warning at boot time - better
than nothing.

I'm not happy bothering people with a new config parameter in such a
simple case. I just don't know how to fix it better. Is there a safe
way to teach compiler to optimize against NR_CPUS other than telling
it explicitly?

> Of course, the problem for others remain that distros will do that
> CONFIG_CPUMASK_OFFSTACK thing, and then things will suck regardless.
> 
> I was *so* happy with our clever "you can have large cpumasks, and
> we'll just allocate them off the stack" long long ago, because it
> meant that we could have one single source tree where this was all
> cleanly abstracted away, and we even had nice types and type safety
> for it all.
> 
> That meant that we could support all the fancy SGI machines with
> several thousand cores, and it all "JustWorked(tm)", and didn't make
> the normal case any worse.
> 
> I didn't expect distros to then go "ooh, we want that too", and enable
> it all by default, and make all our clever "you only see this
> indirection if you need it" go away, and now the normal case is the
> *bad* case, unless you just build your own kernel and pick sane
> defaults.
>
> Oh well.

From distro people's perspective, 'one size fits all' is the best
approach. It's hard to blame them.

Thanks,
Yury
David Vernet March 4, 2023, 4:41 p.m. UTC | #35
On Fri, Mar 03, 2023 at 09:51:41PM -0800, Yury Norov wrote:
> On Fri, Mar 03, 2023 at 07:42:36PM -0800, Linus Torvalds wrote:
> > ing: quoted-printable
> > Status: O
> > Content-Length: 1593
> > Lines: 41
> > 
> > On Fri, Mar 3, 2023 at 7:25 PM Yury Norov <yury.norov@gmail.com> wrote:
> > >
> > > Did you enable CONFIG_FORCE_NR_CPUS? If you pick it, the kernel will
> > > bind nr_cpu_ids to NR_CPUS at compile time, and the memset() call
> > > should disappear.
> > 
> > I do not believe CONFIG_FORCE_NR_CPUS makes any sense, and I think I
> > told you so at the time.
> 
> At that time you was OK with CONFIG_FORCE_NR_CPUS, only suggested to
> hide it behind CONFIG_EXPERT:
> 
> https://lore.kernel.org/all/Yzx4fSmmr8bh6gdl@yury-laptop/T/#m92d405527636154c3b2000e0105379170d988315
>  
> > This all used to just work *without* some kind of config thing, First
> > removing the automatic "do the right thing", and then adding a config
> > option to "force" doing the right thing seems more than a bit silly to
> > me.
> > 
> > I think CONFIG_FORCE_NR_CPUS should go away, and - once more - become
> > just the "is the cpumask small enough to be just allocated directly"
> > thing.
> 
> This all was just broken. For example, as I mentioned in commit message,
> cpumask_full() was broken. I know because I wrote a test. There were no
> a single user for the function, and nobody complained. Now we have one
> in BPF code. So if we simply revert the aa47a7c215e, it will hurt real
> users.

FWIW, we can remove bpf_cpumask_full() and any other affected
bpf_cpumask kfuncs if we need to. kfuncs have no strict stability
guarantees (they're kernel symbols meant for use by kernel programs, see
[0]), so we can remove them without worrying about user space breakages
or stability issues. They were also added relatively recently, and as
far as I know, Tejun and I are thus far the only people using them.

[0]: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/Documentation/bpf/kfuncs.rst#n350

Of course, I'd prefer that we didn't remove any of them, but if we have
to then BPF won't get in the way. As you pointed out below, there are
scenarios beyond cpumask_full() where (many) non-BPF callers could trip
on this as well, so IMO what we have today makes sense (assuming there's
not some other clever way to correctly optimize for NR_CPUS without the
extra config option, as you also said below).

Thanks,
David

> The pre-CONFIG_FORCE_NR_CPUS cpumask machinery would work only if you
> set NR_CPUS to the number that matches to the actual number of CPUs as
> detected at boot time.
> 
> In your example, if you have NR_CPUS == 64, and for some reason disable
> hyper threading, nr_cpumask_bits will be set to 64 at compile time, but
> nr_cpu_ids will be set to 32 at boot time, assuming
> CONFIG_CPUMASK_OFFSTACK is disabled.
> 
> And the following code will be broken:
> 
> cpumask_t m1, m2;
> 
> cpumask_setall(m1); // m1 is ffff ffff ffff ffff because it uses
>                     // compile-time optimized nr_cpumask_bits
> 
> for_each_cpu(cpu, m1) // 32 iterations because it relied on nr_cpu_ids
>         cpumask_set_cpu(cpu, m2); // m2 is ffff ffff XXXX XXXX
> 
> BUG_ON(!cpumask_equal(m1, m2)); // Bug because it will test all 64 bits
>
> Today with CONFIG_FORCE_NR_CPUS disabled, kernel consistently relies
> on boot-time defined nr_cpu_ids in functions like cpumask_equal()
> with the cost of disabled runtime optimizations.
> 
> If CONFIG_FORCE_NR_CPUS is enabled, it wires nr_cpu_ids to NR_CPUS
> at compile time, which allows compile-time optimization.
> 
> If CONFIG_FORCE_NR_CPUS is enabled, but actual number of CPUs doesn't
> match to NR_CPUS, the kernel throws a warning at boot time - better
> than nothing.
> 
> I'm not happy bothering people with a new config parameter in such a
> simple case. I just don't know how to fix it better. Is there a safe
> way to teach compiler to optimize against NR_CPUS other than telling
> it explicitly?
> 
> > Of course, the problem for others remain that distros will do that
> > CONFIG_CPUMASK_OFFSTACK thing, and then things will suck regardless.
> > 
> > I was *so* happy with our clever "you can have large cpumasks, and
> > we'll just allocate them off the stack" long long ago, because it
> > meant that we could have one single source tree where this was all
> > cleanly abstracted away, and we even had nice types and type safety
> > for it all.
> > 
> > That meant that we could support all the fancy SGI machines with
> > several thousand cores, and it all "JustWorked(tm)", and didn't make
> > the normal case any worse.
> > 
> > I didn't expect distros to then go "ooh, we want that too", and enable
> > it all by default, and make all our clever "you only see this
> > indirection if you need it" go away, and now the normal case is the
> > *bad* case, unless you just build your own kernel and pick sane
> > defaults.
> >
> > Oh well.
> 
> From distro people's perspective, 'one size fits all' is the best
> approach. It's hard to blame them.
> 
> Thanks,
> Yury
Mateusz Guzik March 4, 2023, 7:01 p.m. UTC | #36
On 3/3/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Fri, Mar 3, 2023 at 12:39 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>
>> I think there is a systemic problem which comes with the kzalloc API
>
> Well, it's not necessarily the API that is bad, but the implementation.
>
> We could easily make kzalloc() with a constant size just expand to
> kmalloc+memset, and get the behavior you want.
>
> We already do magical things for "find the right slab bucket" part of
> kmalloc too for constant sizes. It's changed over the years, but that
> policy goes back a long long time. See
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=95203fe78007f9ab3aebb96606473ae18c00a5a8
>
> from the BK history tree.
>
> Exactly because some things are worth optimizing for when the size is
> known at compile time.
>
> Maybe just extending kzalloc() similarly? Trivial and entirely untested
> patch:
>
>    --- a/include/linux/slab.h
>    +++ b/include/linux/slab.h
>    @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct
> kmem_cache *k, gfp_t flags)
>      */
>     static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
>     {
>    +    if (__builtin_constant_p(size)) {
>    +            void *ret = kmalloc(size, flags);
>    +            if (ret)
>    +                    memset(ret, 0, size);
>    +            return ret;
>    +    }
>         return kmalloc(size, flags | __GFP_ZERO);
>     }
>

So I played with this and have a rather nasty summary. Bullet points:
1. patched kzalloc does not reduce memsets calls during kernel build
2. patched kmem_cache_zalloc_ptr + 2 consumers converted *does* drop
it significantly (36150671 -> 14414454)
3. ... inline memset generated by gcc sucks by resorting to rep stosq
around 48 bytes
4. memsets not sorted out have sizes not known at compilation time and
are not necessarily perf bugs on their own [read: would benefit from
faster memset]

Onto the the meat:

I patched the kernel with a slightly tweaked version of the above:
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 45af70315a94..7abb5490690f 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct
kmem_cache *k, gfp_t flags)
  */
 static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
 {
+       if (__builtin_constant_p(size)) {
+               void *ret = kmalloc(size, flags);
+               if (likely(ret))
+                       memset(ret, 0, size);
+               return ret;
+       }
        return kmalloc(size, flags | __GFP_ZERO);
 }

and verified it indeed zeroes inline:

void kztest(void)
{
        void *ptr;

        ptr = kzalloc(32, GFP_KERNEL);
        if (unlikely(!ptr))
                return;
        memsettest_rec(ptr);
}

$ objdump --disassemble=kztest vmlinux
[snip]
call   ffffffff8135e130 <kmalloc_trace>
test   %rax,%rax
je     ffffffff81447d5f <kztest+0x4f>
movq   $0x0,(%rax)
mov    %rax,%rdi
movq   $0x0,0x8(%rax)
movq   $0x0,0x10(%rax)
movq   $0x0,0x18(%rax)
call   ffffffff81454060 <memsettest_rec>
[snip]

This did *NOT* lead to reduction of memset calls when building the kernel.

I verified few cases by hand, it is all either kmem_cache_zalloc or
explicitly added memsets with sizes not known at compilation time.

Two most frequent callers:
@[
    memset+5
    __alloc_file+40
    alloc_empty_file+73
    path_openat+77
    do_filp_open+182
    do_sys_openat2+159
    __x64_sys_openat+89
    do_syscall_64+93
    entry_SYSCALL_64_after_hwframe+114
]: 11028994
@[
    memset+5
    security_file_alloc+45
    __alloc_file+89
    alloc_empty_file+73
    path_openat+77
    do_filp_open+182
    do_sys_openat2+159
    __x64_sys_openat+89
    do_syscall_64+93
    entry_SYSCALL_64_after_hwframe+114
]: 11028994

My wip addition is:

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 45af70315a94..12b5b02ef3d3 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -710,6 +710,17 @@ static inline void *kmem_cache_zalloc(struct
kmem_cache *k, gfp_t flags)
        return kmem_cache_alloc(k, flags | __GFP_ZERO);
 }

+#define kmem_cache_zalloc_ptr(k, f, retp) ({                           \
+       __typeof(retp) _retp = kmem_cache_alloc(k, f);                  \
+       bool _rv = false;                                               \
+       retp = _retp;                                                   \
+       if (likely(_retp)) {                                            \
+               memset(_retp, 0, sizeof(*_retp));                       \
+               _rv = true;                                             \
+       }                                                               \
+       _rv;                                                            \
+})
+
diff --git a/security/security.c b/security/security.c
index cf6cc576736f..0f769ede0e54 100644
--- a/security/security.c
+++ b/security/security.c
@@ -600,8 +600,7 @@ static int lsm_file_alloc(struct file *file)
                return 0;
        }

-       file->f_security = kmem_cache_zalloc(lsm_file_cache, GFP_KERNEL);
-       if (file->f_security == NULL)
+       if (!kmem_cache_zalloc_ptr(lsm_file_cache, GFP_KERNEL,
file->f_security))
                return -ENOMEM;
        return 0;
 }
diff --git a/fs/file_table.c b/fs/file_table.c
index 372653b92617..8e0dabf9530e 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -136,8 +136,7 @@ static struct file *__alloc_file(int flags, const
struct cred *cred)
        struct file *f;
        int error;

-       f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
-       if (unlikely(!f))
+       if (!kmem_cache_zalloc_ptr(filp_cachep, GFP_KERNEL, f))
                return ERR_PTR(-ENOMEM);

        f->f_cred = get_cred(cred);

As mentioned above it cuts total calls in more than half.

The problem is it is it rolls with rep stosq way too easily, partially
defeating the point of inlining anything. clang does not have this
problem.

Take a look at __alloc_file:
[snip]
mov    0x19cab05(%rip),%rdi        # ffffffff82df4318 <filp_cachep>
call   ffffffff813dd610 <kmem_cache_alloc>
test   %rax,%rax
je     ffffffff814298b7 <__alloc_file+0xc7>
mov    %rax,%r12
mov    $0x1d,%ecx
xor    %eax,%eax
mov    %r12,%rdi
rep stos %rax,%es:(%rdi)
[/snip]

Here is a sample consumer which can't help but have a variable size --
select, used by gmake:
@[
    memset+5
    do_pselect.constprop.0+202
    __x64_sys_pselect6+101
    do_syscall_64+93
    entry_SYSCALL_64_after_hwframe+114
]: 13160

In conclusion:
1. fixing up memsets is worthwhile regardless of what happens to its
current consumers -- not all of them are necessarily doing something
wrong
2. inlining memset can be a pessimization vs non-plain-erms memset as
evidenced above. will have to figure out how to convince gcc to be
less eager to use it.

Sometimes I hate computers.
Linus Torvalds March 4, 2023, 7:02 p.m. UTC | #37
On Fri, Mar 3, 2023 at 9:51 PM Yury Norov <yury.norov@gmail.com> wrote:
>
> At that time you was OK with CONFIG_FORCE_NR_CPUS, only suggested to
> hide it behind CONFIG_EXPERT:

I think there was a mis-communication.

I as violently *not* ok with that question at all. I think our Kconfig
phase is really annoying and nasty, and we should not ask people
questions that they don't know what they mean.

So putting it behind EXPERT was a "at that point, the question is
gone", and I'm ok with the config variable existing.

But..

I am *not* ok with you then thinking that "oh, the config variable
exists, so our default code generation can be complete crap".

The kernel should do well by default. No amount of "but you could go
into magic config variables and then force options that might be ok
for embedded systems to make it generate ok code".

I think it's completely crazy that the distros enable MAXSMP. But
that's their choice. A *sane* distro should not do that, and then we
limit the normal kernel to something sane like a couple of hundreds of
CPUs rather than thousands of them (and the associated huge overhead).

At that point, things like cpumap_clear() should be a couple of stores
- not a call-out to a variable-sized memset().

Notice? Simple and understandable Kconfig questions like "do you
*really* need to support thousands of CPU's"

Not that FORCE_NR_CPUS that is _completely_ useless to any
distribution and thus completely unacceptable as a regular question.

See the difference?

          Linus
Linus Torvalds March 4, 2023, 7:19 p.m. UTC | #38
On Fri, Mar 3, 2023 at 9:51 PM Yury Norov <yury.norov@gmail.com> wrote:
>
> And the following code will be broken:
>
> cpumask_t m1, m2;
>
> cpumask_setall(m1); // m1 is ffff ffff ffff ffff because it uses
>                     // compile-time optimized nr_cpumask_bits
>
> for_each_cpu(cpu, m1) // 32 iterations because it relied on nr_cpu_ids
>         cpumask_set_cpu(cpu, m2); // m2 is ffff ffff XXXX XXXX

So  honestly, it looks like you picked an example of something very
unusual to then make everything else slower.

Rather than commit aa47a7c215e7, we should just have fixed 'setall()'
and 'for_each_cpu()' to use nr_cpu_ids, and then the rest would
continue to use nr_cpumask_bits.

That particular code sequence is arguably broken to begin with.
setall() should really only be used as a mask, most definitely not as
some kind of "all possible cpus".

The latter is "cpu_possible_mask", which is very different indeed (and
often what you want is "cpu_online_mask")

But I'd certainly be ok with using nr_cpu_ids for setall, partly
exactly because it's so rare. It would probably be better to remove it
entirely, but whatever.

              Linus
Al Viro March 4, 2023, 8:18 p.m. UTC | #39
On Fri, Mar 03, 2023 at 09:39:11PM +0100, Mateusz Guzik wrote:

> the allocation routine does not have any information about the size
> available at compilation time, so has to resort to a memset call at
> runtime. Instead, should this be:
> 
> f = kmem_cache_alloc(...);
> memset(f, 0, sizeof(*f));
> 
> ... the compiler could in principle inititalize stuff as indicated by
> code and emit zerofill for the rest. Interestingly, last I checked
> neither clang nor gcc knew how to do it, they instead resort to a full
> sized memset anyway, which is quite a bummer.

For struct file I wouldn't expect a win from that, TBH.
Mateusz Guzik March 4, 2023, 8:31 p.m. UTC | #40
On 3/4/23, Mateusz Guzik <mjguzik@gmail.com> wrote:
> On 3/3/23, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> On Fri, Mar 3, 2023 at 12:39 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>>>
>>> I think there is a systemic problem which comes with the kzalloc API
>>
>> Well, it's not necessarily the API that is bad, but the implementation.
>>
>> We could easily make kzalloc() with a constant size just expand to
>> kmalloc+memset, and get the behavior you want.
>>
>> We already do magical things for "find the right slab bucket" part of
>> kmalloc too for constant sizes. It's changed over the years, but that
>> policy goes back a long long time. See
>>
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git/commit/?id=95203fe78007f9ab3aebb96606473ae18c00a5a8
>>
>> from the BK history tree.
>>
>> Exactly because some things are worth optimizing for when the size is
>> known at compile time.
>>
>> Maybe just extending kzalloc() similarly? Trivial and entirely untested
>> patch:
>>
>>    --- a/include/linux/slab.h
>>    +++ b/include/linux/slab.h
>>    @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct
>> kmem_cache *k, gfp_t flags)
>>      */
>>     static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
>>     {
>>    +    if (__builtin_constant_p(size)) {
>>    +            void *ret = kmalloc(size, flags);
>>    +            if (ret)
>>    +                    memset(ret, 0, size);
>>    +            return ret;
>>    +    }
>>         return kmalloc(size, flags | __GFP_ZERO);
>>     }
>>
>
> So I played with this and have a rather nasty summary. Bullet points:
> 1. patched kzalloc does not reduce memsets calls during kernel build
> 2. patched kmem_cache_zalloc_ptr + 2 consumers converted *does* drop
> it significantly (36150671 -> 14414454)
> 3. ... inline memset generated by gcc sucks by resorting to rep stosq
> around 48 bytes
> 4. memsets not sorted out have sizes not known at compilation time and
> are not necessarily perf bugs on their own [read: would benefit from
> faster memset]
>
> Onto the the meat:
>
> I patched the kernel with a slightly tweaked version of the above:
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 45af70315a94..7abb5490690f 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -717,6 +717,12 @@ static inline void *kmem_cache_zalloc(struct
> kmem_cache *k, gfp_t flags)
>   */
>  static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
>  {
> +       if (__builtin_constant_p(size)) {
> +               void *ret = kmalloc(size, flags);
> +               if (likely(ret))
> +                       memset(ret, 0, size);
> +               return ret;
> +       }
>         return kmalloc(size, flags | __GFP_ZERO);
>  }
>
> and verified it indeed zeroes inline:
>
> void kztest(void)
> {
>         void *ptr;
>
>         ptr = kzalloc(32, GFP_KERNEL);
>         if (unlikely(!ptr))
>                 return;
>         memsettest_rec(ptr);
> }
>
> $ objdump --disassemble=kztest vmlinux
> [snip]
> call   ffffffff8135e130 <kmalloc_trace>
> test   %rax,%rax
> je     ffffffff81447d5f <kztest+0x4f>
> movq   $0x0,(%rax)
> mov    %rax,%rdi
> movq   $0x0,0x8(%rax)
> movq   $0x0,0x10(%rax)
> movq   $0x0,0x18(%rax)
> call   ffffffff81454060 <memsettest_rec>
> [snip]
>
> This did *NOT* lead to reduction of memset calls when building the kernel.
>
> I verified few cases by hand, it is all either kmem_cache_zalloc or
> explicitly added memsets with sizes not known at compilation time.
>
> Two most frequent callers:
> @[
>     memset+5
>     __alloc_file+40
>     alloc_empty_file+73
>     path_openat+77
>     do_filp_open+182
>     do_sys_openat2+159
>     __x64_sys_openat+89
>     do_syscall_64+93
>     entry_SYSCALL_64_after_hwframe+114
> ]: 11028994
> @[
>     memset+5
>     security_file_alloc+45
>     __alloc_file+89
>     alloc_empty_file+73
>     path_openat+77
>     do_filp_open+182
>     do_sys_openat2+159
>     __x64_sys_openat+89
>     do_syscall_64+93
>     entry_SYSCALL_64_after_hwframe+114
> ]: 11028994
>
> My wip addition is:
>
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 45af70315a94..12b5b02ef3d3 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -710,6 +710,17 @@ static inline void *kmem_cache_zalloc(struct
> kmem_cache *k, gfp_t flags)
>         return kmem_cache_alloc(k, flags | __GFP_ZERO);
>  }
>
> +#define kmem_cache_zalloc_ptr(k, f, retp) ({                           \
> +       __typeof(retp) _retp = kmem_cache_alloc(k, f);                  \
> +       bool _rv = false;                                               \
> +       retp = _retp;                                                   \
> +       if (likely(_retp)) {                                            \
> +               memset(_retp, 0, sizeof(*_retp));                       \
> +               _rv = true;                                             \
> +       }                                                               \
> +       _rv;                                                            \
> +})
> +
> diff --git a/security/security.c b/security/security.c
> index cf6cc576736f..0f769ede0e54 100644
> --- a/security/security.c
> +++ b/security/security.c
> @@ -600,8 +600,7 @@ static int lsm_file_alloc(struct file *file)
>                 return 0;
>         }
>
> -       file->f_security = kmem_cache_zalloc(lsm_file_cache, GFP_KERNEL);
> -       if (file->f_security == NULL)
> +       if (!kmem_cache_zalloc_ptr(lsm_file_cache, GFP_KERNEL,
> file->f_security))
>                 return -ENOMEM;
>         return 0;
>  }

This one is actually buggy -- f_security is a void * pointer and
sizeof(*file->f_security) returns just 1. The macro does not have any
safety belts against that -- it should probably check for void * at
compilation time and get a BUG_ON for runtime mismatch. Does not
affect the idea though.

Good news: gcc provides a lot of control as to how it inlines string
ops, most notably:
       -mstringop-strategy=alg
           Override the internal decision heuristic for the particular
algorithm to use for inlining string
           operations.  The allowed values for alg are:

           rep_byte
           rep_4byte
           rep_8byte
               Expand using i386 "rep" prefix of the specified size.

           byte_loop
           loop
           unrolled_loop
               Expand into an inline loop.

           libcall
               Always use a library call.

I'm going to play with it and send something more presentable.

> diff --git a/fs/file_table.c b/fs/file_table.c
> index 372653b92617..8e0dabf9530e 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -136,8 +136,7 @@ static struct file *__alloc_file(int flags, const
> struct cred *cred)
>         struct file *f;
>         int error;
>
> -       f = kmem_cache_zalloc(filp_cachep, GFP_KERNEL);
> -       if (unlikely(!f))
> +       if (!kmem_cache_zalloc_ptr(filp_cachep, GFP_KERNEL, f))
>                 return ERR_PTR(-ENOMEM);
>
>         f->f_cred = get_cred(cred);
>
> As mentioned above it cuts total calls in more than half.
>
> The problem is it is it rolls with rep stosq way too easily, partially
> defeating the point of inlining anything. clang does not have this
> problem.
>
> Take a look at __alloc_file:
> [snip]
> mov    0x19cab05(%rip),%rdi        # ffffffff82df4318 <filp_cachep>
> call   ffffffff813dd610 <kmem_cache_alloc>
> test   %rax,%rax
> je     ffffffff814298b7 <__alloc_file+0xc7>
> mov    %rax,%r12
> mov    $0x1d,%ecx
> xor    %eax,%eax
> mov    %r12,%rdi
> rep stos %rax,%es:(%rdi)
> [/snip]
>
> Here is a sample consumer which can't help but have a variable size --
> select, used by gmake:
> @[
>     memset+5
>     do_pselect.constprop.0+202
>     __x64_sys_pselect6+101
>     do_syscall_64+93
>     entry_SYSCALL_64_after_hwframe+114
> ]: 13160
>
> In conclusion:
> 1. fixing up memsets is worthwhile regardless of what happens to its
> current consumers -- not all of them are necessarily doing something
> wrong
> 2. inlining memset can be a pessimization vs non-plain-erms memset as
> evidenced above. will have to figure out how to convince gcc to be
> less eager to use it.
>
> Sometimes I hate computers.
>
> --
> Mateusz Guzik <mjguzik gmail.com>
>
Linus Torvalds March 4, 2023, 8:34 p.m. UTC | #41
On Sat, Mar 4, 2023 at 11:19 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Rather than commit aa47a7c215e7, we should just have fixed 'setall()'
> and 'for_each_cpu()' to use nr_cpu_ids, and then the rest would
> continue to use nr_cpumask_bits.

Looking around, there's a few more "might set high bits in the
bitmask" cases - notably cpumask_complement(). It uses
bitmap_complement(), which in turn can set bits past the end of the
bit range.

Of course, that function is then used in exactly one place (ia64 ACPI):

        cpumask_complement(&tmp_map, cpu_present_mask);
        cpu = cpumask_first(&tmp_map);

it's basically a nasty way or writing

        cpu = cpumask_first_zero(cpu_present_mask);

so the "fix" is to just do that cleanup, and get rid of
"cpumask_complement()" entirely.

So I would suggest we simply do something like the attached patch.

NOTE! This is *entirely* untested. See the _intent_ behind the patch
in the big comment above the 'nr_cpumask_bits' #define.

So this patch is more of a "maybe something like this?"

And no, nothing here helps the MAXSMP case. I don't think it's
entirely unfixable, but it's close. Some very involved static jump
infrastructure *might* make the MAXSMP case be something we could
generate good code for too, but the whole "we potentially have
thousands of CPUs" case really shouldn't have ever been used on normal
machines.

It is what it is. I think the best we can do is to try to generate
good code for a distribution that cares about good code. Once the
distro maintainers go "let's enable MAXSMP even though the kernel
Kconfig help file tells us not to", there's very little we as kernel
maintainers can do.

                   Linus
Mateusz Guzik March 4, 2023, 8:42 p.m. UTC | #42
On 3/4/23, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Fri, Mar 03, 2023 at 09:39:11PM +0100, Mateusz Guzik wrote:
>
>> the allocation routine does not have any information about the size
>> available at compilation time, so has to resort to a memset call at
>> runtime. Instead, should this be:
>>
>> f = kmem_cache_alloc(...);
>> memset(f, 0, sizeof(*f));
>>
>> ... the compiler could in principle inititalize stuff as indicated by
>> code and emit zerofill for the rest. Interestingly, last I checked
>> neither clang nor gcc knew how to do it, they instead resort to a full
>> sized memset anyway, which is quite a bummer.
>
> For struct file I wouldn't expect a win from that, TBH.
>

That was mostly for illustrative purposes, but you are right -- turns
out the slab is 256 bytes in size per obj and only a small fraction of
it is inititalized in the allocation routine. Good candidate to always
punt to memset in the allocator as it happens now.

Bummer though, it is also one of the 2 most memset'ed during kernel
build. The other one allocated at the same rate is lsm_file_cache and
that's only 32 bytes in size, so it will get a win.
Linus Torvalds March 4, 2023, 8:48 p.m. UTC | #43
On Sat, Mar 4, 2023 at 12:31 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> Good news: gcc provides a lot of control as to how it inlines string
> ops, most notably:
>        -mstringop-strategy=alg

Note that any static decision is always going to be crap somewhere.
You can make it do the "optimal" thing for any particular machine, but
I consider that to be just garbage.

What I would actually like to see is the compiler always generate an
out-of-line call for the "big enough to not just do inline trivially"
case, but do so with the "rep stosb/movsb" calling convention.

Then we'd just mark those with objdump, and patch it up dynamically to
either use the right out-of-line memset/memcpy function, *or* just
replace it entirely with 'rep stosb' inline.

Because the cores that do this right *do* exist, despite your hatred
of the rep string instructions. At least Borislav claims that the
modern AMD cores do better with 'rep stosb'.

In particular, see what we do for 'clear_user()', where we effectively
can do the above (because unlike memset, we control it entirely). See
commit 0db7058e8e23 ("x86/clear_user: Make it faster").

Once we'd have that kind of infrastructure, we could then control
exactly what 'memset()' does.

And I note that we should probably have added Borislav to the cc when
memset came up, exactly because he's been looking at it anyway. Even
if AMD seems to have slightly different optimization rules than Intel
cores probably do. But again, that only emphasizes the whole "we
should not have a static choice here".

                 Linus
Yury Norov March 4, 2023, 8:51 p.m. UTC | #44
On Sat, Mar 04, 2023 at 11:19:54AM -0800, Linus Torvalds wrote:
> On Fri, Mar 3, 2023 at 9:51 PM Yury Norov <yury.norov@gmail.com> wrote:
> >
> > And the following code will be broken:
> >
> > cpumask_t m1, m2;
> >
> > cpumask_setall(m1); // m1 is ffff ffff ffff ffff because it uses
> >                     // compile-time optimized nr_cpumask_bits
> >
> > for_each_cpu(cpu, m1) // 32 iterations because it relied on nr_cpu_ids
> >         cpumask_set_cpu(cpu, m2); // m2 is ffff ffff XXXX XXXX
> 
> So  honestly, it looks like you picked an example of something very
> unusual to then make everything else slower.

What about bootable sticks?
 
> Rather than commit aa47a7c215e7, we should just have fixed 'setall()'
> and 'for_each_cpu()' to use nr_cpu_ids, and then the rest would
> continue to use nr_cpumask_bits.

No, that wouldn't work for all.

> That particular code sequence is arguably broken to begin with.
> setall() should really only be used as a mask, most definitely not as
> some kind of "all possible cpus".

Sorry, don't understand this.

> The latter is "cpu_possible_mask", which is very different indeed (and
> often what you want is "cpu_online_mask")

Don't understand this possible vs online argument, but OK. What about this?

        if (cpumask_setall_is_fixed)
                cpumask_setall(mask);
        else {
                for_each_online_cpu(cpu)
                        cpumask_set_cpu(cpu, mask);
             }

        // You forgot to 'fix' cpumask_equal()
        BUG_ON(!cpumask_equal(cpu_online_mask, mask));

Or this:

        cpumask_clear(mask);
        for_each_cpu_not(cpu, mask)
                cpumask_set_cpu(cpu, mask);
        BUG_ON(!cpumask_full(mask));

The root of the problem is that half of cpumask API relies (relied) on
compile-time nr_cpumask_bits, and the other - on boot-time nr_cpu_ids.

So, if you consistently propagate your 'fix', you'll get rid of
nr_cpumask_bits entirely with all that associate overhead.

That's what I actually did. And even tried to minimize the
overhead the best way I can think of.

> But I'd certainly be ok with using nr_cpu_ids for setall, partly
> exactly because it's so rare. It would probably be better to remove it
> entirely, but whatever.
> 
>               Linus
Linus Torvalds March 4, 2023, 9:01 p.m. UTC | #45
On Sat, Mar 4, 2023 at 12:51 PM Yury Norov <yury.norov@gmail.com> wrote:
>
> > That particular code sequence is arguably broken to begin with.
> > setall() should really only be used as a mask, most definitely not as
> > some kind of "all possible cpus".
>
> Sorry, don't understand this.

See the example patch I sent out.

Literally just make the rule be "we play games with cpumasks in that
they have two different 'sizes', so just make sure the bits in the
bigger and faster size are always clear".

That simple rule just means that we can then use that bigger constant
size in all cases where "upper bits zero" just don't matter.

Which is basically all of them.

Your for_each_cpu_not() example is actually a great example: it should
damn well not exist at all. I hadn't even noticed how broken it was.
Exactly like the other broken case (that I *did* notice -
cpumask_complement), it has no actual valid users. It _literally_ only
exists as a pointless test-case.

So this is *literally* what I'm talking about: you are making up silly
cases that then act as "arguments" for making all the _real_ cases
slower.

Stop it.

Silly useless cases are just that - silly and useless. They should not
be arguments for the real cases then being optimized and simplified.

Updated patch to remove 'for_each_cpu_not()' attached.

It's still completely untested. Treat this very much as a "Let's make
the common cases faster, at least for !MAXSMP".

                   Linus
Linus Torvalds March 4, 2023, 9:03 p.m. UTC | #46
On Sat, Mar 4, 2023 at 1:01 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Silly useless cases are just that - silly and useless. They should not
> be arguments for the real cases then being optimized and simplified.

There's a missing "not" above, that was hopefully obvious from the
context: "They should not be arguments for the real cases then NOT
being optimized and simplified"

               Linus
Linus Torvalds March 4, 2023, 9:10 p.m. UTC | #47
On Sat, Mar 4, 2023 at 1:01 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It's still completely untested. Treat this very much as a "Let's make
> the common cases faster, at least for !MAXSMP".

Ok, so I started "testing" it in the sense that I actually looked at
the code it generated, and went all "why didn't it make any
difference".

And that's because the patch had the

  #ifdef CONFIG_CPUMASK_OFFSTACK

condition exactly the wrong way around.

So if somebody actually wants to play with that patch, you need to
change that to be

  #ifndef CONFIG_CPUMASK_OFFSTACK

(and then you obviously need to have a kernel config that does *not*
have MAXSMP set).

That at least simplifies some of the code generation when I look at
it. Whether the end result _works_ or not, I still haven't checked.
That patch is still very much meant as a "look, something like this
should make our cpumask handling much more efficient"

              Linus
Linus Torvalds March 4, 2023, 11:08 p.m. UTC | #48
On Sat, Mar 4, 2023 at 1:10 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Whether the end result _works_ or not, I still haven't checked.

Well, this particular patch at least boots for me for my normal
config. Not that I've run any extensive tests, but I'm writing this
email while running this patch, so ..

           Linus
Linus Torvalds March 4, 2023, 11:52 p.m. UTC | #49
On Sat, Mar 4, 2023 at 3:08 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Well, this particular patch at least boots for me for my normal
> config. Not that I've run any extensive tests, but I'm writing this
> email while running this patch, so ..

Hmm. I enabled the KUNIT tests, and used an odd CONFIG_NR_CPUS to test
this a bit more.

So in my situation, I have 64 threads, and so nr_cpu_ids is 64, and
CONFIG_NR_CPUS is 150.

Then one cpumask KUNIT test fails with

     # test_cpumask_weight: EXPECTATION FAILED at lib/cpumask_kunit.c:70
                  Expected ((unsigned int)150) == cpumask_weight(&mask_all), but
                      ((unsigned int)150) == 150 (0x96)
                      cpumask_weight(&mask_all) == 64 (0x40)
              &mask_all contains CPUs 0-63

but I think that's actually a KUNIT test bug.

The KUNIT test there is

        KUNIT_EXPECT_EQ_MSG(test, nr_cpumask_bits,
cpumask_weight(&mask_all), MASK_MSG(&mask_all));

and it should *not* expect the cpumask weight to be nr_cpumask_bits,
it should expect it to be nr_cpu_ids.

That only matters now that nr_cpumask_bits isn't the same as nr_cpu_ids./

Anyway, I still think that patch of mine is fine, and I think this
test failure only ends up being about the test, not the patch.

            Linus
David Laight March 5, 2023, 5:23 p.m. UTC | #50
From: Linus Torvalds
> Sent: 04 March 2023 20:48
> 
> On Sat, Mar 4, 2023 at 12:31 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> > Good news: gcc provides a lot of control as to how it inlines string
> > ops, most notably:
> >        -mstringop-strategy=alg
> 
> Note that any static decision is always going to be crap somewhere.
> You can make it do the "optimal" thing for any particular machine, but
> I consider that to be just garbage.
> 
> What I would actually like to see is the compiler always generate an
> out-of-line call for the "big enough to not just do inline trivially"
> case, but do so with the "rep stosb/movsb" calling convention.

I think you also want it to differentiate between requests that
are known to be a whole number of words and ones that might
be byte sized.

For the kmalloc+memzero case you know you can zero a whole
number of words - so all the checks memset has to do for
byte length/alignment can be removed.

The same is true for memcpy() calls used for structure copies.
The compiler knows that aligned full-word copies can be done.
So it shouldn't be calling a function that has to redo the tests.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Linus Torvalds March 5, 2023, 6:17 p.m. UTC | #51
On Sun, Mar 5, 2023 at 1:26 AM Sedat Dilek <sedat.dilek@gmail.com> wrote:
>
> can you share your "normal config", please?

Well, I just have CONFIG_NR_CPUS set to 64.

That happens to be the number of actual cores (well, threads) I have
on my desktop, but it's what I use for my laptop too (that has 8
cores).

Basically, I consider CONFIG_NR_CPUS=64 is likely the "sweet spot" for
code generation and still covering 99+% of all machines out there.

Now, MAXSMP is great for (a) coverage testing and for (b) being able
to handle pretty much *anything* out there, but it really was
originally meant for the SGI kind of hardware: not exactly
off-the-shelf.

So I use MAXSMP for compile testing (by virtue of "allmodconfig"), and
it's great for that. But unless you have more than several hundred
cpus in your machine, you should never use it.

There are a few main issues with MAXSMP:

 - the simple ("common") embedded cpu masks end up being big (ie any
data structure that has a "cpumask_t" in it will be huge, just because
the static size of 'struct cpumask' is 8192 bits, ie 1kB)

 - the fancy case of using a "cpumask_var_t" will use a pointer and a
dynamic allocation (which is then sized to be appropriate to the
*actual* number of CPU's, so that you don't have to allocate 8192 bits
for everything).

 - the code generation ends up inevitably being about variable-sized
loops, because nobody wants to traverse those kinds of data structures

In contrast, if you use CONFIG_NR_CPUS=64, both the embeddeed and
"fancy" version will be just a single 64-bit word. No extra pointer
overhead, no indirection through said pointers, and no need for loops
(ok, there will still be loops for walking the bits in the word, but a
lot of them will actually be about using instructions like "bsf" etc).

So you end up with better code, smaller data structures, and less
pointer chasing.

So those two situations are generally the two "sane" configurations: a
good reasonable NR_CPUS that works for just about everybody, and then
another extreme config for the 1% (or - more likely - 0.01%)

Now, it's not like 64 is somehow magical. Picking something like
NR_CPUS of 192 is perfectly fine too - it will use three words for the
bitmap, it will still avoid the pointer indirection, it will have a
few small fixed-size loops. It's certainly not *wrong*. It will cover
bigger HEDT machines, but I feel like the HEDT people probably are
special enough that they could probably just use the MAXSMP case, or -
if they care - just build their own kernels.

So you *can* certainly pick other values. We used to have special UP
vs SMP kernel builds, and that clearly no longer makes sense. Nobody
cares about UP on x86-64.

But I do feel like MAXSMP is simply not a great config for 99.9% of
all people, and if you are willing to have two configs, then that "64
or MAXSMP" seems to be the sane split.

And with that split, there will be *very* few people who actually use MAXSMP.

               Linus
Linus Torvalds March 5, 2023, 6:43 p.m. UTC | #52
On Sun, Mar 5, 2023 at 10:17 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> There are a few main issues with MAXSMP:

It's probably worth noting that most architectures don't even support
MAXSMP at all.

Only x86-64 does.

For example, ia64 and sparc64, which both did techncially support a
lot of cores, just made "cpumask_t" huge, and had no support for the
whole "use a pointer to an indirect allocation".

That ends up meaning that you allocate those huge structures on the
stack or just make other structures enormous when they contain a CPU
mask, but it mostly works. It's a horrid, horrid model, though. But at
least ia64 had 64kB stacks anyway, and in the book of "bad engineering
decisions of Itanium", this is all just a footnote.

arm64 also has that "range 2 4096" for number of CPUs but defaults to
a much saner 256 cpus.

I suspect (and sincerely hope) that nobody actually tries to use an
arm64 build with that 4k cpu build. If/when arm64 actually does get up
to that 'thousands of cores" situation, they'll hopefully enable the
MAXSMP kind of indirection and off-stack cpu mask arrays.

So MAXSMP and the whole CPUMASK_OFFSTACK option is an architecture
choice, and you don't have to do it the way x86-64 does it. But the
x86 choice is likely the best tested and thought out by far.

For example, POWERPC technically supports CPUMASK_OFFSTACK too, but
really only in theory. On powerpc, you have

    config NR_CPUS
          range 2 8192 if SMP
          default "32" if PPC64

so while configuration the range is technically up to 8k CPUs, I doubt
people use that value very much. And we have

        select CPUMASK_OFFSTACK if NR_CPUS >= 8192

so it only uses that OFFSTACK one if you pick exactly 8192 CPUs (which
presumably nobody does in real life outside of build testing - it's
not the default, and I think most of the POWER range tops up in the
192 core range, eg E980 with 16 sockets of 12 cores each).

So I suspect that x86-64 is the *only* one to actually use this
widely, and I think distros have been *much* too eager to do so.

The fact that most distros default to

    CONFIG_MAXSMP=y
    CONFIG_NR_CPUS=8192

seems pretty crazy, when I have a hard time finding anything with more
than 192 cores.  I'm sure they exist. But do they _really_ run
unmodified vendor kernels?

               Linus
Yury Norov March 6, 2023, 5:43 a.m. UTC | #53
On Sat, Mar 04, 2023 at 03:08:49PM -0800, Linus Torvalds wrote:
> On Sat, Mar 4, 2023 at 1:10 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Whether the end result _works_ or not, I still haven't checked.
> 
> Well, this particular patch at least boots for me for my normal
> config. Not that I've run any extensive tests, but I'm writing this
> email while running this patch, so ..
> 
>            Linus

I didn't test it properly, but the approach looks good. Need some time
to think on implications of the new rule. At the first glance, there
should be no major impact on cpumask machinery. 

It should be very well tested on arm and m68k because they implement
their own bitmap functions.

Please see comments inline.

Thanks,
Yury

[...]

> diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
> index 10c92bd9b807..bd9576e8d856 100644
> --- a/include/linux/cpumask.h
> +++ b/include/linux/cpumask.h
> @@ -50,8 +50,30 @@ static inline void set_nr_cpu_ids(unsigned int nr)
>  #endif
>  }
>  
> -/* Deprecated. Always use nr_cpu_ids. */
> -#define nr_cpumask_bits	nr_cpu_ids
> +/*
> + * The difference between nr_cpumask_bits and nr_cpu_ids is that
> + * 'nr_cpu_ids' is the actual number of CPU ids in the system, while
> + * nr_cpumask_bits is a "reasonable upper value" that is often more
> + * efficient because it can be a fixed constant.
> + *
> + * So when clearing or traversing a cpumask, use 'nr_cpumask_bits',
> + * but when checking exact limits (and when _setting_ bits), use the
> + * tighter exact limit of 'nr_cpu_ids'.
> + *
> + * NOTE! The code depends on any exyta bits in nr_cpumask_bits a always

s/exyta/extra ?
s/a always/as always ?

> + * being (a) allocated and (b) zero, so that the only effect of using
> + * 'nr_cpumask_bits' is that we might return a higher maximum CPU value
> + * (which is why we have that pattern of
> + *
> + *   Returns >= nr_cpu_ids if no cpus set.
> + *
> + * for many of the functions - they can return that higher value).
> + */
> +#ifndef CONFIG_CPUMASK_OFFSTACK
> + #define nr_cpumask_bits ((unsigned int)NR_CPUS)
> +#else
> + #define nr_cpumask_bits	nr_cpu_ids
> +#endif
>  
>  /*
>   * The following particular system cpumasks and operations manage
> @@ -114,7 +136,7 @@ static __always_inline void cpu_max_bits_warn(unsigned int cpu, unsigned int bit
>  /* verify cpu argument to cpumask_* operators */
>  static __always_inline unsigned int cpumask_check(unsigned int cpu)
>  {
> -	cpu_max_bits_warn(cpu, nr_cpumask_bits);
> +	cpu_max_bits_warn(cpu, nr_cpu_ids);
>  	return cpu;
>  }
>  
> @@ -248,16 +270,6 @@ unsigned int cpumask_next_and(int n, const struct cpumask *src1p,
>  #define for_each_cpu(cpu, mask)				\
>  	for_each_set_bit(cpu, cpumask_bits(mask), nr_cpumask_bits)
>  
> -/**
> - * for_each_cpu_not - iterate over every cpu in a complemented mask
> - * @cpu: the (optionally unsigned) integer iterator
> - * @mask: the cpumask pointer
> - *
> - * After the loop, cpu is >= nr_cpu_ids.
> - */
> -#define for_each_cpu_not(cpu, mask)				\
> -	for_each_clear_bit(cpu, cpumask_bits(mask), nr_cpumask_bits)
> -

We can do it like:

    for ((bit) = 0;
         (bit) = find_next_zero_bit((addr), nr_cpumask_bits, (bit)),
         (bit) < nr_cpu_ids;
         (bit)++)

>  #if NR_CPUS == 1
>  static inline
>  unsigned int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool wrap)
> @@ -495,10 +507,14 @@ static __always_inline bool cpumask_test_and_clear_cpu(int cpu, struct cpumask *
>  /**
>   * cpumask_setall - set all cpus (< nr_cpu_ids) in a cpumask
>   * @dstp: the cpumask pointer
> + *
> + * Note: since we set bits, we should use the tighter 'bitmap_set()' with
> + * the eact number of bits, not 'bitmap_fill()' that will fill past the

s/eact/exact

> + * end.
>   */
>  static inline void cpumask_setall(struct cpumask *dstp)
>  {
> -	bitmap_fill(cpumask_bits(dstp), nr_cpumask_bits);
> +	bitmap_set(cpumask_bits(dstp), 0, nr_cpu_ids);
>  }

It should be like:

 +	bitmap_set(cpumask_bits(dstp), 0, nr_cpu_ids);
 +	bitmap_clear(cpumask_bits(dstp), nr_cpu_ids, nr_cpumask_bits);

Because bitmap_set() will not zero memory beyond round_up(nr_cpu_ids, 64).
diff mbox series

Patch

diff --git a/fs/open.c b/fs/open.c
index 82c1a28b3308..2afed058250c 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -367,7 +367,37 @@  COMPAT_SYSCALL_DEFINE6(fallocate, int, fd, int, mode, compat_arg_u64_dual(offset
  * access() needs to use the real uid/gid, not the effective uid/gid.
  * We do this by temporarily clearing all FS-related capabilities and
  * switching the fsuid/fsgid around to the real ones.
+ *
+ * Creating new credentials is expensive, so we try to skip doing it,
+ * which we can if the result would match what we already got.
  */
+static bool access_need_override_creds(int flags)
+{
+	const struct cred *cred;
+
+	if (flags & AT_EACCESS)
+		return false;
+
+	cred = current_cred();
+	if (!uid_eq(cred->fsuid, cred->uid) ||
+	    !gid_eq(cred->fsgid, cred->gid))
+		return true;
+
+	if (!issecure(SECURE_NO_SETUID_FIXUP)) {
+		kuid_t root_uid = make_kuid(cred->user_ns, 0);
+		if (!uid_eq(cred->uid, root_uid)) {
+			if (!cap_isclear(cred->cap_effective))
+				return true;
+		} else {
+			if (!cap_isidentical(cred->cap_effective,
+			    cred->cap_permitted))
+				return true;
+		}
+	}
+
+	return false;
+}
+
 static const struct cred *access_override_creds(void)
 {
 	const struct cred *old_cred;
@@ -377,6 +407,12 @@  static const struct cred *access_override_creds(void)
 	if (!override_cred)
 		return NULL;
 
+	/*
+	 * XXX access_need_override_creds performs checks in hopes of skipping
+	 * this work. Make sure it stays in sync if making any changes in this
+	 * routine.
+	 */
+
 	override_cred->fsuid = override_cred->uid;
 	override_cred->fsgid = override_cred->gid;
 
@@ -436,7 +472,7 @@  static long do_faccessat(int dfd, const char __user *filename, int mode, int fla
 	if (flags & AT_EMPTY_PATH)
 		lookup_flags |= LOOKUP_EMPTY;
 
-	if (!(flags & AT_EACCESS)) {
+	if (access_need_override_creds(flags)) {
 		old_cred = access_override_creds();
 		if (!old_cred)
 			return -ENOMEM;