diff mbox series

vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements

Message ID 20240411001012.12513-1-torvalds@linux-foundation.org (mailing list archive)
State New
Headers show
Series vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements | expand

Commit Message

Linus Torvalds April 11, 2024, 12:10 a.m. UTC
"The definition of insanity is doing the same thing over and over
    again and expecting different results”

We've tried to do this before, most recently with commit bb2314b47996
("fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink") about a
decade ago.

But the effort goes back even further than that, eg this thread back
from 1998 that is so old that we don't even have it archived in lore:

    https://lkml.org/lkml/1998/3/10/108

which also points out some of the reasons why it's dangerous.

Or, how about then in 2003:

    https://lkml.org/lkml/2003/4/6/112

where we went through some of the same arguments, just wirh different
people involved.

In particular, having access to a file descriptor does not necessarily
mean that you have access to the path that was used for lookup, and
there may be very good reasons why you absolutely must not have access
to a path to said file.

For example, if we were passed a file descriptor from the outside into
some limited environment (think chroot, but also user namespaces etc) a
'flink()' system call could now make that file visible inside a context
where it's not supposed to be visible.

In the process the user may also be able to re-open it with permissions
that the original file descriptor did not have (eg a read-only file
descriptor may be associated with an underlying file that is writable).

Another variation on this is if somebody else (typically root) opens a
file in a directory that is not accessible to others, and passes the
file descriptor on as a read-only file.  Again, the access to the file
descriptor does not imply that you should have access to a path to the
file in the filesystem.

So while we have tried this several times in the past, it never works.

The last time we did this, that commit bb2314b47996 quickly got reverted
again in commit f0cc6ffb8ce8 (Revert "fs: Allow unprivileged linkat(...,
AT_EMPTY_PATH) aka flink"), with a note saying "We may re-do this once
the whole discussion about the interface is done".

Well, the discussion is long done, and didn't come to any resolution.
There's no question that 'flink()' would be a useful operation, but it's
a dangerous one.

However, it does turn out that since 2008 (commit d76b0d9b2d87: "CRED:
Use creds in file structs") we have had a fairly straightforward way to
check whether the file descriptor was opened by the same credentials as
the credentials of the flink().

That allows the most common patterns that people want to use, which tend
to be to either open the source carefully (ie using the openat2()
RESOLVE_xyz flags, and/or checking ownership with fstat() before
linking), or to use O_TMPFILE and fill in the file contents before it's
exposed to the world with linkat().

But it also means that if the file descriptor was opened by somebody
else, or we've gone through a credentials change since, the operation no
longer works (unless we have CAP_DAC_READ_SEARCH capabilities, as
before).

Note that the credential equality check is done by using pointer
equality, which means that it's not enough that you have effectively the
same user - they have to be literally identical, since our credentials
are using copy-on-write semantics.

So you can't change your credentials to something else and try to change
it back to the same ones between the open() and the linkat().  This is
not meant to be some kind of generic permission check, this is literally
meant as a "the open and link calls are 'atomic' wrt user credentials"
check.

It also means that you can't just move things between namespaces,
because the credentials aren't just a list of uid's and gid's: they
includes the pointer to the user_ns that the capabilities are relative
to.

So let's try this one more time and see if maybe this approach ends up
being workable after all.

Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
 fs/namei.c            | 17 ++++++++++++-----
 include/linux/namei.h |  1 +
 2 files changed, 13 insertions(+), 5 deletions(-)

Comments

Linus Torvalds April 11, 2024, 12:20 a.m. UTC | #1
On Wed, 10 Apr 2024 at 17:10, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>    "The definition of insanity is doing the same thing over and over
>     again and expecting different results”

Note that I'm sending this patch out not because I plan to commit it,
but to see if people can shoot holes in the concept.

There's a reason why people have tried to do this for decades.

There's also a reason why it has not worked out well.

             Linus
Linus Torvalds April 11, 2024, 2:39 a.m. UTC | #2
On Wed, 10 Apr 2024 at 17:10, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> +               if (flags & LOOKUP_DFD_MATCH_CREDS) {
> +                       if (f.file->f_cred != current_cred() &&
> +                           !capable(CAP_DAC_READ_SEARCH)) {
> +                               fdput(f);
> +                               return ERR_PTR(-ENOENT);
> +                       }
> +               }

Side note: I suspect that this could possibly be relaxed further, by
making the rule be that if something has been explicitly opened to be
used as a path (ie O_PATH was used at open time), we can link to it
even across different credentials.

IOW, the above could perhaps even be

+               if (flags & LOOKUP_DFD_MATCH_CREDS) {
+                       if (!(f.file->f_mode & FMODE_PATH) &&
+                           f.file->f_cred != current_cred() &&
+                           !capable(CAP_DAC_READ_SEARCH)) {
+                               fdput(f);
+                               return ERR_PTR(-ENOENT);
+                       }
+               }

which would _allow_ people to pass in paths as file descriptors if
they actually wanted to.

After all, the only thing you can do with an O_PATH file descriptor is
to use it as a path - there would be no other reason to use O_PATH in
the first place. So if you now pass it to somebody else, clearly you
are intentionally trying to make it available *as* a path.

So you could imagine doing something like this:

         // Open path as root
         int fd = open('filename", O_PATH);

        // drop privileges
        // setresuid(..) or chmod() or enter new namespace or whatever

        linkat(fd, "", AT_FDCWD, "newname", AT_EMPTY_PATH);

and it would open the path with one set of privileges, but then
intentionally go into a more restricted mode and create a link to the
source within that restricted environment.

Sensible? Who knows. I'm just throwing this out as another "this may
be the solution to our historical flink() issues".

           Linus
Christian Brauner April 11, 2024, 9:04 a.m. UTC | #3
On Wed, Apr 10, 2024 at 07:39:49PM -0700, Linus Torvalds wrote:
> On Wed, 10 Apr 2024 at 17:10, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > +               if (flags & LOOKUP_DFD_MATCH_CREDS) {
> > +                       if (f.file->f_cred != current_cred() &&
> > +                           !capable(CAP_DAC_READ_SEARCH)) {
> > +                               fdput(f);
> > +                               return ERR_PTR(-ENOENT);
> > +                       }
> > +               }
> 
> Side note: I suspect that this could possibly be relaxed further, by
> making the rule be that if something has been explicitly opened to be
> used as a path (ie O_PATH was used at open time), we can link to it
> even across different credentials.

I had a similar discussion a while back someone requested that we relax
permissions so linkat can be used in containers. And I drafted the
following patch back then:

https://lore.kernel.org/all/20231113-undenkbar-gediegen-efde5f1c34bc@brauner

IOW, I had intended to make this work with containers so that we check
CAP_DAC_READ_SEARCH in the namespace of the opener of the file. My
thinking had been that this can serve as a way to say "Hey, I could've
opened this file in the openers namespace therefore let me make a path
to it.". I didn't actually send it because I thought the original author
would but imho, that would be a worthwhile addition to your patch if
this makes sense...

> 
> IOW, the above could perhaps even be
> 
> +               if (flags & LOOKUP_DFD_MATCH_CREDS) {
> +                       if (!(f.file->f_mode & FMODE_PATH) &&
> +                           f.file->f_cred != current_cred() &&
> +                           !capable(CAP_DAC_READ_SEARCH)) {
> +                               fdput(f);
> +                               return ERR_PTR(-ENOENT);
> +                       }
> +               }
> 
> which would _allow_ people to pass in paths as file descriptors if
> they actually wanted to.
> 
> After all, the only thing you can do with an O_PATH file descriptor is
> to use it as a path - there would be no other reason to use O_PATH in
> the first place. So if you now pass it to somebody else, clearly you
> are intentionally trying to make it available *as* a path.
> 
> So you could imagine doing something like this:
> 
>          // Open path as root
>          int fd = open('filename", O_PATH);
> 
>         // drop privileges
>         // setresuid(..) or chmod() or enter new namespace or whatever
> 
>         linkat(fd, "", AT_FDCWD, "newname", AT_EMPTY_PATH);
> 
> and it would open the path with one set of privileges, but then
> intentionally go into a more restricted mode and create a link to the
> source within that restricted environment.
> 
> Sensible? Who knows. I'm just throwing this out as another "this may
> be the solution to our historical flink() issues".
> 
>            Linus
Christian Brauner April 11, 2024, 12:25 p.m. UTC | #4
On Thu, Apr 11, 2024 at 11:04:59AM +0200, Christian Brauner wrote:
> On Wed, Apr 10, 2024 at 07:39:49PM -0700, Linus Torvalds wrote:
> > On Wed, 10 Apr 2024 at 17:10, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > +               if (flags & LOOKUP_DFD_MATCH_CREDS) {
> > > +                       if (f.file->f_cred != current_cred() &&
> > > +                           !capable(CAP_DAC_READ_SEARCH)) {
> > > +                               fdput(f);
> > > +                               return ERR_PTR(-ENOENT);
> > > +                       }
> > > +               }
> > 
> > Side note: I suspect that this could possibly be relaxed further, by
> > making the rule be that if something has been explicitly opened to be
> > used as a path (ie O_PATH was used at open time), we can link to it
> > even across different credentials.
> 
> I had a similar discussion a while back someone requested that we relax
> permissions so linkat can be used in containers. And I drafted the
> following patch back then:
> 
> https://lore.kernel.org/all/20231113-undenkbar-gediegen-efde5f1c34bc@brauner
> 
> IOW, I had intended to make this work with containers so that we check
> CAP_DAC_READ_SEARCH in the namespace of the opener of the file. My
> thinking had been that this can serve as a way to say "Hey, I could've
> opened this file in the openers namespace therefore let me make a path
> to it.". I didn't actually send it because I thought the original author
> would but imho, that would be a worthwhile addition to your patch if
> this makes sense...

For example, say someone opened an O_PATH fd in the initial user ns and
then send that file over an AF_UNIX socket to some other container the
ns_capable(f_cred->user_ns, CAP_DAC_READ_SEARCH) would always be false.
The other way around though would work. Which imho is exactly what we
want to make such cross-container interactions with linkat() safe.

And this didn't aim to solve the problem of allowing unprivileged users
in the initial namespace to do linkat(), of course which yours does.

Btw, I think we should try to avoid putting this into path_init() and
confine this to linkat() itself imho. The way I tried to do it was by
presetting a root for filename_lookup(); means we also don't need a
LOOKUP_* flag for this as this is mostly a linkat thing.

So maybe your suggestion combined with my own attempt would make this
work for unprivileged users and containers?

if (f.file->f_cred != current_cred() &&
    !ns_capable(f.file->f_cred->user_ns, CAP_DAC_READ_SEARCH))

Worst case we get a repeat of the revert and get to make this a 10 year
anniversary patch attempt?
Linus Torvalds April 11, 2024, 4:15 p.m. UTC | #5
On Thu, 11 Apr 2024 at 02:05, Christian Brauner <brauner@kernel.org> wrote:
>
> I had a similar discussion a while back someone requested that we relax
> permissions so linkat can be used in containers.

Hmm.

Ok, that's different - it just wants root to be able to do it, but
"root" being just in the container itself.

I don't think that's all that useful - I think one of the issues with
linkat(AT_EMPTY_PATH) is exactly that "it's only useful for root",
which means that it's effectively useless. Inside a container or out.

Because very few loads run as root-only (and fewer still run with any
capability bits that aren't just "root or nothing").

Before I did all this, I did a Debian code search for linkat with
AT_EMPTY_PATH, and it's almost non-existent. And I think it's exactly
because of this "when it's only useful for root, it's hardly useful at
all" issue.

(Of course, my Debian code search may have been broken).

So I suspect your special case is actually largely useless, and what
the container user actually wanted was what my patch does, but they
didn't think that was possible, so they asked to just extend the
"root" notion.

I've added Charles to the Cc.

But yes, with my patch, it would now be trivial to make that

        capable(CAP_DAC_READ_SEARCH)

test also be

        ns_capable(f.file->f_cred->user_ns, CAP_DAC_READ_SEARCH)

instead. I suspect not very many would care any more, but it does seem
conceptually sensible.

As to your patch - I don't like your nd->root  games in that patch at
all. That looks odd.

Yes, it makes lookup ignore the dfd (so you avoid the TOCTOU issue),
but it also makes lookup ignore "/". Which happens to be ok with an
empty path, but still...

So it feels to me like that patch of yours mis-uses something that is
just meant for vfs_path_lookup().

It may happen to work, but it smells really odd to me.

             Linus
Linus Torvalds April 11, 2024, 4:21 p.m. UTC | #6
On Thu, 11 Apr 2024 at 05:25, Christian Brauner <brauner@kernel.org> wrote:
>
> Btw, I think we should try to avoid putting this into path_init() and
> confine this to linkat() itself imho. The way I tried to do it was by
> presetting a root for filename_lookup(); means we also don't need a
> LOOKUP_* flag for this as this is mostly a linkat thing.

So I had the exact reverse reaction to your patch - I felt that using
that 'root' thing was the hacky case.

The lookup flag may be limited to linkat(), but it makes the code
smaller and clearer, and avoids having multiple places where we check
dfd.

And that 'root' argument really is the special hacky case, and is not
actually used by any normal system call path, and is meant for
internal kernel use rather than any generic case.

           Linus
Charles Mirabile April 11, 2024, 4:44 p.m. UTC | #7
On Thu, Apr 11, 2024 at 12:15 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 11 Apr 2024 at 02:05, Christian Brauner <brauner@kernel.org> wrote:
> >
> > I had a similar discussion a while back someone requested that we relax
> > permissions so linkat can be used in containers.
>
> Hmm.
>
> Ok, that's different - it just wants root to be able to do it, but
> "root" being just in the container itself.
>
> I don't think that's all that useful - I think one of the issues with
> linkat(AT_EMPTY_PATH) is exactly that "it's only useful for root",
> which means that it's effectively useless. Inside a container or out.
>
> Because very few loads run as root-only (and fewer still run with any
> capability bits that aren't just "root or nothing").
>
> Before I did all this, I did a Debian code search for linkat with
> AT_EMPTY_PATH, and it's almost non-existent. And I think it's exactly
> because of this "when it's only useful for root, it's hardly useful at
> all" issue.
>
> (Of course, my Debian code search may have been broken).
>
> So I suspect your special case is actually largely useless, and what
> the container user actually wanted was what my patch does, but they
> didn't think that was possible, so they asked to just extend the
> "root" notion.
>
Yes, that is absolutely the case. When Christian poked holes in my
initial submission, I started working on a v2 but haven't had a chance
to send it because I wanted to make sure my arguments etc were
airtight because I am well aware of just how long and storied the
history of flink is. In the v2 that I didn't post yet, I extended the
ability to link *any* file from only true root to also "root" within a
container following the potentially suspect approach that christian
suggested (I see where you are coming from with the unobvious approach
to avoiding toctou though I obviously support this patch being
merged), but I added a followup patch that checks for whether the file
in question is an `__O_TMPFILE` file which is trivial once we are
jumping through the hoops of looking up the struct file. If it is a
tmpfile (i.e. linkcount = 0) I think that all the concerns raised
about processes that inherit a fd being able to create links to the
file inappropriately are moot. Here is an excerpt from the cover
letter I was planning to send that explains my reasoning.

 -  the ability to create an unnamed file, adjust its contents
and attributes (i.e. uid, gid, mode, xattrs, etc) and then only give it a name
once they are ready is exactly the ideal use-case for a hypothetical `flink`
system call. The ability to use `AT_EMPTY_PATH` with `linkat` essentially turns
it into `flink`, but these restrictions hamper it from actually being used, as
most code is not privileged. By checking whether the file to be linked is a
temporary (i.e. unnamed) file we can allow this useful case without allowing
the more risky cases that could have security implications.

 - Although it might appear that the security posture is affected, it is not.
Callers who open an extant file on disk in read only mode for sharing with
potentially untrustworthy code can trust that this change will not affect them
because it only applies to temporary files. Callers who open a temporary file
need to do so with write access if they want to actually put any data in it,
so they will already have to reopen the file (e.g. by linking it to a path
and opening that path, or opening the magic symlink in proc) if they want to
share it in read-only mode with untrustworthy code. As such, that new file
description will no longer be marked as a temporary file and thus these changes
do not apply. The final case I can think of is the unlikely possibility of
intentionally sharing read write access to data stored in a temporary file with
untrustworthy code, but where that code being able to make a link would still
be deleterious. In such a bizarre case, the `O_EXCL` can and should be used to
fully prevent the temporary file from ever having a name, and this change does
not alter its efficacy.

> I've added Charles to the Cc.
>
> But yes, with my patch, it would now be trivial to make that
>
>         capable(CAP_DAC_READ_SEARCH)
>
> test also be
>
>         ns_capable(f.file->f_cred->user_ns, CAP_DAC_READ_SEARCH)
>
> instead. I suspect not very many would care any more, but it does seem
> conceptually sensible.
>
> As to your patch - I don't like your nd->root  games in that patch at
> all. That looks odd.
>
> Yes, it makes lookup ignore the dfd (so you avoid the TOCTOU issue),
> but it also makes lookup ignore "/". Which happens to be ok with an
> empty path, but still...
>
> So it feels to me like that patch of yours mis-uses something that is
> just meant for vfs_path_lookup().
>
> It may happen to work, but it smells really odd to me.
>
>              Linus
>
Charles Mirabile April 11, 2024, 5:29 p.m. UTC | #8
Here is an updated version of linus's patch that makes the check
namespace relative
---
 fs/namei.c            | 17 ++++++++++++-----
 include/linux/namei.h |  1 +
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 4e0de939fea1..2bcc10976ba7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2419,6 +2419,14 @@ static const char *path_init(struct nameidata
*nd, unsigned flags)
         if (!f.file)
             return ERR_PTR(-EBADF);

+        if (flags & LOOKUP_DFD_MATCH_CREDS) {
+            if (f.file->f_cred != current_cred() &&
+                !ns_capable(f.file->f_cred->user_ns, CAP_DAC_READ_SEARCH)) {
+                fdput(f);
+                return ERR_PTR(-ENOENT);
+            }
+        }
+
         dentry = f.file->f_path.dentry;

         if (*s && unlikely(!d_can_lookup(dentry))) {
@@ -4640,14 +4648,13 @@ int do_linkat(int olddfd, struct filename
*old, int newdfd,
         goto out_putnames;
     }
     /*
-     * To use null names we require CAP_DAC_READ_SEARCH
+     * To use null names we require CAP_DAC_READ_SEARCH or
+     * that the open-time creds of the dfd matches current.
      * This ensures that not everyone will be able to create
      * handlink using the passed filedescriptor.
      */
-    if (flags & AT_EMPTY_PATH && !capable(CAP_DAC_READ_SEARCH)) {
-        error = -ENOENT;
-        goto out_putnames;
-    }
+    if (flags & AT_EMPTY_PATH)
+        how |= LOOKUP_DFD_MATCH_CREDS;

     if (flags & AT_SYMLINK_FOLLOW)
         how |= LOOKUP_FOLLOW;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 74e0cc14ebf8..678ffe4acf99 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -44,6 +44,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
 #define LOOKUP_BENEATH        0x080000 /* No escaping from starting point. */
 #define LOOKUP_IN_ROOT        0x100000 /* Treat dirfd as fs root. */
 #define LOOKUP_CACHED        0x200000 /* Only do cached lookup */
+#define LOOKUP_DFD_MATCH_CREDS    0x400000 /* Require that dfd creds
match current */
 /* LOOKUP_* flags which do scope-related checks based on the dirfd. */
 #define LOOKUP_IS_SCOPED (LOOKUP_BENEATH | LOOKUP_IN_ROOT)
Charles Mirabile April 11, 2024, 5:35 p.m. UTC | #9
And a slightly dubious addition to bypass these checks for tmpfiles
across the board.

I don't like putting the details of this in the path lookup code, but
it is definitely nicer here than
looking up the fd twice, once here and again in do_linkat

These changes not strictly necessary, because the existing patch
allows unprivileged
tmpfile flink as long as the creds don't change, but I do think that
my reasoning is sound
that linking a tmpfile should always be ok whether or not creds have changed

---
 fs/namei.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/namei.c b/fs/namei.c
index 2bcc10976ba7..5c9fabc39481 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2421,6 +2421,7 @@ static const char *path_init(struct nameidata
*nd, unsigned flags)

                if (flags & LOOKUP_DFD_MATCH_CREDS) {
                        if (f.file->f_cred != current_cred() &&
+                           !(f.file->f_flags & __O_TMPFILE) &&
                            !ns_capable(f.file->f_cred->user_ns,
CAP_DAC_READ_SEARCH)) {
                                fdput(f);
                                return ERR_PTR(-ENOENT);

On Thu, Apr 11, 2024 at 1:29 PM Charles Mirabile <cmirabil@redhat.com> wrote:
>
> Here is an updated version of linus's patch that makes the check
> namespace relative
> ---
>  fs/namei.c            | 17 ++++++++++++-----
>  include/linux/namei.h |  1 +
>  2 files changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 4e0de939fea1..2bcc10976ba7 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -2419,6 +2419,14 @@ static const char *path_init(struct nameidata
> *nd, unsigned flags)
>          if (!f.file)
>              return ERR_PTR(-EBADF);
>
> +        if (flags & LOOKUP_DFD_MATCH_CREDS) {
> +            if (f.file->f_cred != current_cred() &&
> +                !ns_capable(f.file->f_cred->user_ns, CAP_DAC_READ_SEARCH)) {
> +                fdput(f);
> +                return ERR_PTR(-ENOENT);
> +            }
> +        }
> +
>          dentry = f.file->f_path.dentry;
>
>          if (*s && unlikely(!d_can_lookup(dentry))) {
> @@ -4640,14 +4648,13 @@ int do_linkat(int olddfd, struct filename
> *old, int newdfd,
>          goto out_putnames;
>      }
>      /*
> -     * To use null names we require CAP_DAC_READ_SEARCH
> +     * To use null names we require CAP_DAC_READ_SEARCH or
> +     * that the open-time creds of the dfd matches current.
>       * This ensures that not everyone will be able to create
>       * handlink using the passed filedescriptor.
>       */
> -    if (flags & AT_EMPTY_PATH && !capable(CAP_DAC_READ_SEARCH)) {
> -        error = -ENOENT;
> -        goto out_putnames;
> -    }
> +    if (flags & AT_EMPTY_PATH)
> +        how |= LOOKUP_DFD_MATCH_CREDS;
>
>      if (flags & AT_SYMLINK_FOLLOW)
>          how |= LOOKUP_FOLLOW;
> diff --git a/include/linux/namei.h b/include/linux/namei.h
> index 74e0cc14ebf8..678ffe4acf99 100644
> --- a/include/linux/namei.h
> +++ b/include/linux/namei.h
> @@ -44,6 +44,7 @@ enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
>  #define LOOKUP_BENEATH        0x080000 /* No escaping from starting point. */
>  #define LOOKUP_IN_ROOT        0x100000 /* Treat dirfd as fs root. */
>  #define LOOKUP_CACHED        0x200000 /* Only do cached lookup */
> +#define LOOKUP_DFD_MATCH_CREDS    0x400000 /* Require that dfd creds
> match current */
>  /* LOOKUP_* flags which do scope-related checks based on the dirfd. */
>  #define LOOKUP_IS_SCOPED (LOOKUP_BENEATH | LOOKUP_IN_ROOT)
>
> --
> 2.44.0
>
> On Thu, Apr 11, 2024 at 12:44 PM Charles Mirabile <cmirabil@redhat.com> wrote:
> >
> > On Thu, Apr 11, 2024 at 12:15 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > On Thu, 11 Apr 2024 at 02:05, Christian Brauner <brauner@kernel.org> wrote:
> > > >
> > > > I had a similar discussion a while back someone requested that we relax
> > > > permissions so linkat can be used in containers.
> > >
> > > Hmm.
> > >
> > > Ok, that's different - it just wants root to be able to do it, but
> > > "root" being just in the container itself.
> > >
> > > I don't think that's all that useful - I think one of the issues with
> > > linkat(AT_EMPTY_PATH) is exactly that "it's only useful for root",
> > > which means that it's effectively useless. Inside a container or out.
> > >
> > > Because very few loads run as root-only (and fewer still run with any
> > > capability bits that aren't just "root or nothing").
> > >
> > > Before I did all this, I did a Debian code search for linkat with
> > > AT_EMPTY_PATH, and it's almost non-existent. And I think it's exactly
> > > because of this "when it's only useful for root, it's hardly useful at
> > > all" issue.
> > >
> > > (Of course, my Debian code search may have been broken).
> > >
> > > So I suspect your special case is actually largely useless, and what
> > > the container user actually wanted was what my patch does, but they
> > > didn't think that was possible, so they asked to just extend the
> > > "root" notion.
> > >
> > Yes, that is absolutely the case. When Christian poked holes in my
> > initial submission, I started working on a v2 but haven't had a chance
> > to send it because I wanted to make sure my arguments etc were
> > airtight because I am well aware of just how long and storied the
> > history of flink is. In the v2 that I didn't post yet, I extended the
> > ability to link *any* file from only true root to also "root" within a
> > container following the potentially suspect approach that christian
> > suggested (I see where you are coming from with the unobvious approach
> > to avoiding toctou though I obviously support this patch being
> > merged), but I added a followup patch that checks for whether the file
> > in question is an `__O_TMPFILE` file which is trivial once we are
> > jumping through the hoops of looking up the struct file. If it is a
> > tmpfile (i.e. linkcount = 0) I think that all the concerns raised
> > about processes that inherit a fd being able to create links to the
> > file inappropriately are moot. Here is an excerpt from the cover
> > letter I was planning to send that explains my reasoning.
> >
> >  -  the ability to create an unnamed file, adjust its contents
> > and attributes (i.e. uid, gid, mode, xattrs, etc) and then only give it a name
> > once they are ready is exactly the ideal use-case for a hypothetical `flink`
> > system call. The ability to use `AT_EMPTY_PATH` with `linkat` essentially turns
> > it into `flink`, but these restrictions hamper it from actually being used, as
> > most code is not privileged. By checking whether the file to be linked is a
> > temporary (i.e. unnamed) file we can allow this useful case without allowing
> > the more risky cases that could have security implications.
> >
> >  - Although it might appear that the security posture is affected, it is not.
> > Callers who open an extant file on disk in read only mode for sharing with
> > potentially untrustworthy code can trust that this change will not affect them
> > because it only applies to temporary files. Callers who open a temporary file
> > need to do so with write access if they want to actually put any data in it,
> > so they will already have to reopen the file (e.g. by linking it to a path
> > and opening that path, or opening the magic symlink in proc) if they want to
> > share it in read-only mode with untrustworthy code. As such, that new file
> > description will no longer be marked as a temporary file and thus these changes
> > do not apply. The final case I can think of is the unlikely possibility of
> > intentionally sharing read write access to data stored in a temporary file with
> > untrustworthy code, but where that code being able to make a link would still
> > be deleterious. In such a bizarre case, the `O_EXCL` can and should be used to
> > fully prevent the temporary file from ever having a name, and this change does
> > not alter its efficacy.
> >
> > > I've added Charles to the Cc.
> > >
> > > But yes, with my patch, it would now be trivial to make that
> > >
> > >         capable(CAP_DAC_READ_SEARCH)
> > >
> > > test also be
> > >
> > >         ns_capable(f.file->f_cred->user_ns, CAP_DAC_READ_SEARCH)
> > >
> > > instead. I suspect not very many would care any more, but it does seem
> > > conceptually sensible.
> > >
> > > As to your patch - I don't like your nd->root  games in that patch at
> > > all. That looks odd.
> > >
> > > Yes, it makes lookup ignore the dfd (so you avoid the TOCTOU issue),
> > > but it also makes lookup ignore "/". Which happens to be ok with an
> > > empty path, but still...
> > >
> > > So it feels to me like that patch of yours mis-uses something that is
> > > just meant for vfs_path_lookup().
> > >
> > > It may happen to work, but it smells really odd to me.
> > >
> > >              Linus
> > >
Linus Torvalds April 11, 2024, 6:13 p.m. UTC | #10
On Thu, 11 Apr 2024 at 10:35, Charles Mirabile <cmirabil@redhat.com> wrote:
>
> And a slightly dubious addition to bypass these checks for tmpfiles
> across the board.

Does this make sense?

I 100% agree that one of the primary reasons why people want flink()
is that "open tmpfile, finalize contents and permissions, then link
the final result into the filesystem".

But I would expect that the "same credentials as open" check is the
one that really matters.

And __O_TMPFILE is just a special case that might not even be used -
it's entirely possible to just do the same with a real file (ie
non-O_TMPFILE) and link it in place and remove the original.

Not to mention that ->tmpfile() isn't necessarily even available, so
the whole concept of "use O_TMPFILE and then linkat" is actually
broken. It *has* to be able to fall back to a regular file to work at
all on NFS.

So while I understand your motivation, I actually think it's actively
wrong to special-case __O_TMPFILE, because it encourages a pattern
that is bad.

                    Linus
Linus Torvalds April 11, 2024, 7:34 p.m. UTC | #11
On Thu, 11 Apr 2024 at 11:13, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So while I understand your motivation, I actually think it's actively
> wrong to special-case __O_TMPFILE, because it encourages a pattern
> that is bad.

Just to clarify: I think the ns_capable() change is a good idea and
makes sense. The whole "limited to global root" makes no sense if the
file was opened within a namespace, and I think it always just came
from the better check not being obvious at the point where
AT_EMPTY_PATH was checked for.

Similarly, while the FMODE_PATH test _looks_ very similar to an
O_TMPFILE check, I think it's fundamentally different in a conceptual
sense: not only is FMODE_PATH filesystem-agnostic, a FMODE_PATH file
is *only* useful as a pathname (ie no read/write semantics).

And so if a FMODE_PATH file descriptor is passed in from the outside,
I feel like the "you cannot use this to create a path" is kind of a
fundamentally nonsensical rule.

IOW, whoever is passing that FMODE_PATH file descriptor around must
have actually thought about it, and must have opened it with O_PATH,
and it isn't useful for anything else than as a starting point for a
path lookup.

So while I don't think the __O_TMPFILE exception would necessarily be
wrong per se, I am afraid that it would result in people writing
convenient code that "appears to work" in testing, but then fails when
run in an environment where the directory is mounted over NFS (or any
other filesystem that doesn't do ->tmpfile()).

I am certainly open to be convinced otherwise, but I really think that
the real pattern to aim for should just be "look, I opened the file
myself, then filled in the detail, and now I'm doing a linkat() to
expose it" and that the real protection issue should be that "my
credentials are the same for open and linkat".

The other rules (ie the capability check or the FMODE_PATH case) would
be literally about situations where you *want* to pass things around
between protection domains.

In that context, the ns_capable() and the FMODE_PATH check make sense to me.

In contrast, the __O_TMPFILE check just feels like a random detail.

Hmm?

Anyway, end result of that is that this is what that part of the patch
looks like for me right now:

+               if (flags & LOOKUP_DFD_MATCH_CREDS) {
+                       const struct cred *cred = f.file->f_cred;
+                       if (!(f.file->f_mode & FMODE_PATH) &&
+                           cred != current_cred() &&
+                           !ns_capable(cred->user_ns, CAP_DAC_READ_SEARCH)) {
+                               fdput(f);
+                               return ERR_PTR(-ENOENT);
+                       }
+               }

and that _seems_ sensible to me.

But yes, this all has been something that we have failed to do right
for at least a quarter of a century so far, so this needs a *lot* of
thought, even if the patch itself is rather small and looks relatively
obvious.

                 Linus
Charles Mirabile April 11, 2024, 8:08 p.m. UTC | #12
On Thu, Apr 11, 2024 at 2:14 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, 11 Apr 2024 at 10:35, Charles Mirabile <cmirabil@redhat.com> wrote:
> >
> > And a slightly dubious addition to bypass these checks for tmpfiles
> > across the board.
>
> Does this make sense?
>
> I 100% agree that one of the primary reasons why people want flink()
> is that "open tmpfile, finalize contents and permissions, then link
> the final result into the filesystem".
>
> But I would expect that the "same credentials as open" check is the
> one that really matters.

Certainly. I think that in almost all cases, the pattern of preparing
a file and then using linkat to give it its final name will occur
without changing creds and as such your patch will fix it. When
combined with making the CAP_DAC_READ_SEARCH override aware of
namespaces, I think that covers almost all of the remaining edges
cases (i.e. if the difference in creds was actually you becoming more
privileged and not less, then sure  you can do it).

The only possibility that remains is a difference in creds where the
new creds are not privileged. This is the maybe scary situation that
has blocked flink in the past. All I am suggesting is that you can
decompose this niche situation still further into the case where the
file was opened "ordinarily" in some sense which is the case that is
really concerning (oops, when I forked and dropped privileges before
exec, I forgot to set cloexec on one of my fds that points to
something important, and even though I opened it with O_RDONLY, the
unprivileged code is able to make a new link to it in a directory
they control and re-open with O_RDWR because the mode bits allow say
o+w (extremely bizarre and honestly hypothetical in my opinion, but
ok)) and other special situations.

Those situations namely being O_PATH and O_TMPFILE where by
specifying these special flags during open you are indicating that
you intend to do something special with the file descriptor. I think
if either of these flags are present in the file flags, then we are
not in the concerning case, and I think it could be appropriate to
bypass the matching creds check.

>
> And __O_TMPFILE is just a special case that might not even be used -
> it's entirely possible to just do the same with a real file (ie
> non-O_TMPFILE) and link it in place and remove the original.
>
> Not to mention that ->tmpfile() isn't necessarily even available, so
> the whole concept of "use O_TMPFILE and then linkat" is actually
> broken. It *has* to be able to fall back to a regular file to work at
> all on NFS.
>
> So while I understand your motivation, I actually think it's actively
> wrong to special-case __O_TMPFILE, because it encourages a pattern
> that is bad.

The problem with this is that another process might be able to access
the file during via that name during the brief period before it is
unlinked. If I am not using NFS, I am always going to prefer using
O_TMPFILE. I would rather be able to do that without restriction even
if it isn't the most robust solution by your definition.

In my opinion I think it is more robust in the sense that it is truly
atomic and making a named file is the kludge that you have to do to
work around NFS limitations, but I agree that this is a tiny detail
that I certainly do not want to block this patch over because it
already solves the problem I was actually dealing with. Whether or
not it solves this hypothetical problem is less important.

>
>                     Linus
>
Linus Torvalds April 11, 2024, 8:22 p.m. UTC | #13
On Thu, 11 Apr 2024 at 13:08, Charles Mirabile <cmirabil@redhat.com> wrote:
>
> The problem with this is that another process might be able to access
> the file during via that name during the brief period before it is
> unlinked. If I am not using NFS, I am always going to prefer using
> O_TMPFILE. I would rather be able to do that without restriction even
> if it isn't the most robust solution by your definition.


Oh, absolutely. I think the right pattern is basically some variation of

    fd = open(filename, O_TMPFILE | O_WRONLY, 0600);
    if (fd < 0) {
        char template{...] = ".tmpfileXXXXXX";
        fd = mkstmp(template);
        unlink(template);
    }
    .. now act on fd to initialize it ..
    linkat(fd, "", AT_FDCWD, "finalname", AT_EMPTY_PATH);

which should work reasonably well in various environments.

Clearly O_TMPFILE is the superior option when it exists. I'm just
saying that anything that *relies* on it existing is dubious.

              Linus
Christian Brauner April 12, 2024, 6:41 a.m. UTC | #14
On Thu, Apr 11, 2024 at 12:44:46PM -0400, Charles Mirabile wrote:
> On Thu, Apr 11, 2024 at 12:15 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Thu, 11 Apr 2024 at 02:05, Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > I had a similar discussion a while back someone requested that we relax
> > > permissions so linkat can be used in containers.
> >
> > Hmm.
> >
> > Ok, that's different - it just wants root to be able to do it, but
> > "root" being just in the container itself.
> >
> > I don't think that's all that useful - I think one of the issues with
> > linkat(AT_EMPTY_PATH) is exactly that "it's only useful for root",
> > which means that it's effectively useless. Inside a container or out.
> >
> > Because very few loads run as root-only (and fewer still run with any
> > capability bits that aren't just "root or nothing").
> >
> > Before I did all this, I did a Debian code search for linkat with
> > AT_EMPTY_PATH, and it's almost non-existent. And I think it's exactly
> > because of this "when it's only useful for root, it's hardly useful at
> > all" issue.
> >
> > (Of course, my Debian code search may have been broken).
> >
> > So I suspect your special case is actually largely useless, and what
> > the container user actually wanted was what my patch does, but they
> > didn't think that was possible, so they asked to just extend the
> > "root" notion.
> >
> Yes, that is absolutely the case. When Christian poked holes in my
> initial submission, I started working on a v2 but haven't had a chance
> to send it because I wanted to make sure my arguments etc were
> airtight because I am well aware of just how long and storied the
> history of flink is. In the v2 that I didn't post yet, I extended the
> ability to link *any* file from only true root to also "root" within a
> container following the potentially suspect approach that christian
> suggested (I see where you are coming from with the unobvious approach

I'm sorry, what is suspect about my approach? Your patch in [1] lowered
the capability checks to ns_capable(current_user_ns(), CAP_DAC_READ_SEARCH).
That's very much wrong because it means root in an unprivileged
container could linkat() any file descriptor including any opened by
real root. The permission needs to be tied to the opener of the file.
Otherwise that's guaranteed to break security assumptions. Whereas my
patch avoids that.

[1]: https://lore.kernel.org/all/20231110170615.2168372-2-cmirabil@redhat.com
Christian Brauner April 12, 2024, 6:44 a.m. UTC | #15
On Thu, Apr 11, 2024 at 11:13:53AM -0700, Linus Torvalds wrote:
> On Thu, 11 Apr 2024 at 10:35, Charles Mirabile <cmirabil@redhat.com> wrote:
> >
> > And a slightly dubious addition to bypass these checks for tmpfiles
> > across the board.
> 
> Does this make sense?
> 
> I 100% agree that one of the primary reasons why people want flink()
> is that "open tmpfile, finalize contents and permissions, then link
> the final result into the filesystem".
> 
> But I would expect that the "same credentials as open" check is the
> one that really matters.

Yes. There's no need to give O_TMPFILE special status there. We also end
up with a collection of special-cases which is just unpleasant.
Christian Brauner April 12, 2024, 7:45 a.m. UTC | #16
On Thu, Apr 11, 2024 at 12:34:52PM -0700, Linus Torvalds wrote:
> On Thu, 11 Apr 2024 at 11:13, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > So while I understand your motivation, I actually think it's actively
> > wrong to special-case __O_TMPFILE, because it encourages a pattern
> > that is bad.
> 
> Just to clarify: I think the ns_capable() change is a good idea and
> makes sense. The whole "limited to global root" makes no sense if the
> file was opened within a namespace, and I think it always just came
> from the better check not being obvious at the point where
> AT_EMPTY_PATH was checked for.
> 
> Similarly, while the FMODE_PATH test _looks_ very similar to an
> O_TMPFILE check, I think it's fundamentally different in a conceptual
> sense: not only is FMODE_PATH filesystem-agnostic, a FMODE_PATH file
> is *only* useful as a pathname (ie no read/write semantics).

But the annoying part imho is that we also allow stuff like fsetxattr()
to work on O_PATH file descriptors. Something that I really dislike.

> 
> And so if a FMODE_PATH file descriptor is passed in from the outside,
> I feel like the "you cannot use this to create a path" is kind of a
> fundamentally nonsensical rule.

Hm, I would like to avoid adding an exception for O_PATH. While it is
useful for lookup O_PATH is also often used simply as an opaque handle
to a file. O_PATH is often used to simply identify or compare files, or
for execveat(). Especially for cross-container interactions it would be
nice to not generally allow O_PATH to be linkat()ed. The current
behavior proposed in this patch should be enough.
Christian Brauner April 12, 2024, 8:56 a.m. UTC | #17
On Thu, Apr 11, 2024 at 09:21:27AM -0700, Linus Torvalds wrote:
> On Thu, 11 Apr 2024 at 05:25, Christian Brauner <brauner@kernel.org> wrote:
> >
> > Btw, I think we should try to avoid putting this into path_init() and
> > confine this to linkat() itself imho. The way I tried to do it was by
> > presetting a root for filename_lookup(); means we also don't need a
> > LOOKUP_* flag for this as this is mostly a linkat thing.
> 
> So I had the exact reverse reaction to your patch - I felt that using

Fun. I really disliked the idea of having to somehow wrangle this into
lookup itself and that the root argument just happened to not be used
for non-kernel internal use-cases yet. But I'm not married to this so
fine by me.
Christian Brauner April 12, 2024, 9:07 a.m. UTC | #18
On Wed, 10 Apr 2024 17:10:12 -0700, Linus Torvalds wrote:
>    "The definition of insanity is doing the same thing over and over
>     again and expecting different results”
> 
> We've tried to do this before, most recently with commit bb2314b47996
> ("fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink") about a
> decade ago.
> 
> [...]

So it seems that this might be worth trying. I've picked up the patch
with two modifications:

(1) added the relaxed capability check.
(2) renamed the flag to LOOKUP_LINKAT_EMPTY
(3) slight adjustment to commit message

Should show up in -next if I don't hear objections or you want to apply
this directly. Fingers crossed we don't see regressions.

---

Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc

[1/1] vfs: relax linkat() AT_EMPTY_PATH - aka flink() - requirements
      https://git.kernel.org/vfs/vfs/c/fa75d6e377fd
Linus Torvalds April 12, 2024, 3:36 p.m. UTC | #19
On Fri, 12 Apr 2024 at 00:46, Christian Brauner <brauner@kernel.org> wrote:
>
> Hm, I would like to avoid adding an exception for O_PATH.

Ack. It's not the important or really relevant part.

             Linus
Linus Torvalds April 12, 2024, 5:43 p.m. UTC | #20
Side note: I'd really like to relax another unrelated AT_EMPTY_PATH
issue: we should just allow a NULL path for that case.

The requirement that you pass an actual empty string is insane. It's
wrong. And it adds a noticeable amount of expense to this path,
because just getting the single byte and looking it up is fairly
expensive.

This was more noticeable because glibc at one point (still?) did

        newfstatat(6, "", buf, AT_EMPTY_PATH)

when it should have just done a simple "fstat()".

So there were (are?) a *LOT* of AT_EMPTY_PATH users, and they all do a
pointless "let's copy a string from user space".

And yes, I know exactly why AT_EMPTY_PATH exists: because POSIX
traditionally says that a path of "" has to return -ENOENT, not the
current working directory. So AT_EMPTY_PATH basically says "allow the
empty path for lookup".

But while it *allows* the empty path, it does't *force* it, so it
doesn't mean "avoid the lookup", and we really end up doing a lot of
extra work just for this case. Just the user string copy is a big deal
because of the whole overhead of accessing user space, but it's also
the whole "allocate memory for the path etc".

If we either said "a NULL path with AT_EMPTY_PATH means empty", or
even just added a new AT_NULL_PATH thing that means "path has to be
NULL, and it means the same as AT_EMPTY_PATH with an empty path", we'd
be able to avoid quite a bit of pointless work.

                  Linus
Christian Brauner April 13, 2024, 9:41 a.m. UTC | #21
On Fri, Apr 12, 2024 at 10:43:06AM -0700, Linus Torvalds wrote:
> Side note: I'd really like to relax another unrelated AT_EMPTY_PATH
> issue: we should just allow a NULL path for that case.
> 
> The requirement that you pass an actual empty string is insane. It's
> wrong. And it adds a noticeable amount of expense to this path,
> because just getting the single byte and looking it up is fairly
> expensive.
> 
> This was more noticeable because glibc at one point (still?) did
> 
>         newfstatat(6, "", buf, AT_EMPTY_PATH)
> 
> when it should have just done a simple "fstat()".
> 
> So there were (are?) a *LOT* of AT_EMPTY_PATH users, and they all do a
> pointless "let's copy a string from user space".
> 
> And yes, I know exactly why AT_EMPTY_PATH exists: because POSIX
> traditionally says that a path of "" has to return -ENOENT, not the
> current working directory. So AT_EMPTY_PATH basically says "allow the
> empty path for lookup".
> 
> But while it *allows* the empty path, it does't *force* it, so it
> doesn't mean "avoid the lookup", and we really end up doing a lot of
> extra work just for this case. Just the user string copy is a big deal
> because of the whole overhead of accessing user space, but it's also
> the whole "allocate memory for the path etc".
> 
> If we either said "a NULL path with AT_EMPTY_PATH means empty", or
> even just added a new AT_NULL_PATH thing that means "path has to be
> NULL, and it means the same as AT_EMPTY_PATH with an empty path", we'd
> be able to avoid quite a bit of pointless work.

It also causes issues for sandboxed enviroments (most recently for the
Chrome sandbox) because AT_EMPTY_PATH doesn't actually mean
AT_EMPTY_PATH unless the string is actually empty. Otherwise
AT_EMPTY_PATH is ignored. So I'm all on board for this. I need to think
a bit whether AT_NULL_PATH or just allowing NULL would be nicer. Mostly
because I want to ensure that userspace can easily detect this new
feature.
Christian Brauner April 13, 2024, 3:16 p.m. UTC | #22
On Sat, Apr 13, 2024 at 11:41:57AM +0200, Christian Brauner wrote:
> On Fri, Apr 12, 2024 at 10:43:06AM -0700, Linus Torvalds wrote:
> > Side note: I'd really like to relax another unrelated AT_EMPTY_PATH
> > issue: we should just allow a NULL path for that case.
> > 
> > The requirement that you pass an actual empty string is insane. It's
> > wrong. And it adds a noticeable amount of expense to this path,
> > because just getting the single byte and looking it up is fairly
> > expensive.
> > 
> > This was more noticeable because glibc at one point (still?) did
> > 
> >         newfstatat(6, "", buf, AT_EMPTY_PATH)
> > 
> > when it should have just done a simple "fstat()".
> > 
> > So there were (are?) a *LOT* of AT_EMPTY_PATH users, and they all do a
> > pointless "let's copy a string from user space".
> > 
> > And yes, I know exactly why AT_EMPTY_PATH exists: because POSIX
> > traditionally says that a path of "" has to return -ENOENT, not the
> > current working directory. So AT_EMPTY_PATH basically says "allow the
> > empty path for lookup".
> > 
> > But while it *allows* the empty path, it does't *force* it, so it
> > doesn't mean "avoid the lookup", and we really end up doing a lot of
> > extra work just for this case. Just the user string copy is a big deal
> > because of the whole overhead of accessing user space, but it's also
> > the whole "allocate memory for the path etc".
> > 
> > If we either said "a NULL path with AT_EMPTY_PATH means empty", or
> > even just added a new AT_NULL_PATH thing that means "path has to be
> > NULL, and it means the same as AT_EMPTY_PATH with an empty path", we'd
> > be able to avoid quite a bit of pointless work.
> 
> It also causes issues for sandboxed enviroments (most recently for the
> Chrome sandbox) because AT_EMPTY_PATH doesn't actually mean
> AT_EMPTY_PATH unless the string is actually empty. Otherwise
> AT_EMPTY_PATH is ignored. So I'm all on board for this. I need to think
> a bit whether AT_NULL_PATH or just allowing NULL would be nicer. Mostly
> because I want to ensure that userspace can easily detect this new
> feature.

I think it should be ok to allow AT_EMPTY_PATH with NULL because
userspace can detect whether the kernel allows that by passing
AT_EMPTY_PATH with a NULL path argument and they would get an error back
that would tell them that this kernel doesn't support NULL paths.

I'd like to try a patch for this next week. It's a good opportunity to
get into some of the more gritty details of this area.

From a rough first glance most AT_EMPTY_PATH users should be covered by
adapting getname_flags() accordingly.

Imho, this could likely be done by introducing a single struct filename
null_filename. That also takes care of audit that reuses the pathname.
That thing would basically never go away and the refcnt remain fixed at
one. Kind of similar to what we did for struct mnt_idmap nop_mnt_idmap.
That's at least what I naively hope for but I haven't yet starting
looking into all the dark corners.
Linus Torvalds April 13, 2024, 5:07 p.m. UTC | #23
On Sat, 13 Apr 2024 at 08:16, Christian Brauner <brauner@kernel.org> wrote:
>
> I think it should be ok to allow AT_EMPTY_PATH with NULL because
> userspace can detect whether the kernel allows that by passing
> AT_EMPTY_PATH with a NULL path argument and they would get an error back
> that would tell them that this kernel doesn't support NULL paths.

Yeah, it should return -1 / EFAULT on  older kernels.

> I'd like to try a patch for this next week. It's a good opportunity to
> get into some of the more gritty details of this area.
>
> From a rough first glance most AT_EMPTY_PATH users should be covered by
> adapting getname_flags() accordingly.
>
> Imho, this could likely be done by introducing a single struct filename
> null_filename.

It's probably better to try to special-case it entirely.

See commit 9013c51c630a ("vfs: mostly undo glibc turning 'fstat()'
into 'fstatat(AT_EMPTY_PATH)'") and the numbers in there in
particular.

That still leaves performance on the table exactly because it has to
do that extra "get_user()" to check for an empty path, but it avoids
not only the pathname allocation, but also the setup for the pathname
walk.

If we had a NULL case there, I'd expect that fstatat() and fstat()
would perform the same (modulo a couple of instructions).

Of course, the performance of get_user() will vary depending on
microarchitecture. If you don't have SMAP, it's cheap. It's the
STAC/CLAC that is most of the cost, and the exact cost of those will
then depend on implementations - they *could* be much faster than they
are.

              Linus
diff mbox series

Patch

diff --git a/fs/namei.c b/fs/namei.c
index c5b2a25be7d0..3c684014eb40 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2422,6 +2422,14 @@  static const char *path_init(struct nameidata *nd, unsigned flags)
 		if (!f.file)
 			return ERR_PTR(-EBADF);
 
+		if (flags & LOOKUP_DFD_MATCH_CREDS) {
+			if (f.file->f_cred != current_cred() &&
+			    !capable(CAP_DAC_READ_SEARCH)) {
+				fdput(f);
+				return ERR_PTR(-ENOENT);
+			}
+		}
+
 		dentry = f.file->f_path.dentry;
 
 		if (*s && unlikely(!d_can_lookup(dentry))) {
@@ -4641,14 +4649,13 @@  int do_linkat(int olddfd, struct filename *old, int newdfd,
 		goto out_putnames;
 	}
 	/*
-	 * To use null names we require CAP_DAC_READ_SEARCH
+	 * To use null names we require CAP_DAC_READ_SEARCH or
+	 * that the open-time creds of the dfd matches current.
 	 * This ensures that not everyone will be able to create
 	 * handlink using the passed filedescriptor.
 	 */
-	if (flags & AT_EMPTY_PATH && !capable(CAP_DAC_READ_SEARCH)) {
-		error = -ENOENT;
-		goto out_putnames;
-	}
+	if (flags & AT_EMPTY_PATH)
+		how |= LOOKUP_DFD_MATCH_CREDS;
 
 	if (flags & AT_SYMLINK_FOLLOW)
 		how |= LOOKUP_FOLLOW;
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 74e0cc14ebf8..678ffe4acf99 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -44,6 +44,7 @@  enum {LAST_NORM, LAST_ROOT, LAST_DOT, LAST_DOTDOT};
 #define LOOKUP_BENEATH		0x080000 /* No escaping from starting point. */
 #define LOOKUP_IN_ROOT		0x100000 /* Treat dirfd as fs root. */
 #define LOOKUP_CACHED		0x200000 /* Only do cached lookup */
+#define LOOKUP_DFD_MATCH_CREDS	0x400000 /* Require that dfd creds match current */
 /* LOOKUP_* flags which do scope-related checks based on the dirfd. */
 #define LOOKUP_IS_SCOPED (LOOKUP_BENEATH | LOOKUP_IN_ROOT)