diff mbox series

[RFC] fs: introduce mkdirat2 syscall for atomic mkdir

Message ID 20210228002500.11483-1-sir@cmpwn.com (mailing list archive)
State New, archived
Headers show
Series [RFC] fs: introduce mkdirat2 syscall for atomic mkdir | expand

Commit Message

Drew DeVault Feb. 28, 2021, 12:25 a.m. UTC
The mkdir and mkdirat syscalls both return 0 on success, and use of the
newly-created directory requires a separate open or openat (or openat2)
call. The time between these syscalls is an opportunity for a race
condition. It is thus desirable to establish a means of creating a
directory and returning an open dirfd for it in one atomic operation.

The possibility of using open(at(2)) with O_CREAT | O_DIRECTORY suggests
itself. The present behavior for this flag combination is quite
certainly wrong: it returns ENOTDIR, but creates a file with the desired
name. We could explicitly support this combination of flags, but we are
loathe to expand an already over-burdened syscall with additional flag
combinations.

This introduces mkdirat2, along with the requisite flag argument, which
presently accepts the same flags as open - allowing the caller to
specify, say, O_CLOEXEC - and leaving us room to expand the next time an
unforeseeable addition to mkdir is called for. Otherwise, it behaves
identically to mkdirat, but returns an open file descriptor for the new
directory.
---
This is my first foray into the fs bits, and first syscall addition as
well; the reader's patience with the many obvious errors this is certain
to include is much appreciated.

I am pretty sure that this is actually atomic - gated between
user_path_create and done_path_create - but I admit that I don't grok
the inode locking bits. Otherwise, this was basically made by splicing
together relevant-looking bits of mkdir and openat until a syscall which
appeared to work under testing emerged.

 arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
 arch/arm/tools/syscall.tbl                  |  1 +
 arch/arm64/include/asm/unistd.h             |  2 +-
 arch/arm64/include/asm/unistd32.h           |  2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
 arch/s390/kernel/syscalls/syscall.tbl       |  1 +
 arch/sh/kernel/syscalls/syscall.tbl         |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
 fs/namei.c                                  | 46 ++++++++++++++++++---
 include/linux/syscalls.h                    |  1 +
 include/uapi/asm-generic/unistd.h           |  4 +-
 19 files changed, 62 insertions(+), 7 deletions(-)

Comments

Al Viro Feb. 28, 2021, 2:13 a.m. UTC | #1
On Sat, Feb 27, 2021 at 07:25:00PM -0500, Drew DeVault wrote:

> This introduces mkdirat2, along with the requisite flag argument, which
> presently accepts the same flags as open - allowing the caller to
> specify, say, O_CLOEXEC - and leaving us room to expand the next time an
> unforeseeable addition to mkdir is called for. Otherwise, it behaves
> identically to mkdirat, but returns an open file descriptor for the new
> directory.

No to the ABI part; "on error it returns -E..., on success - 0 or
a non-negative number representing a file descriptor (zero also possible,
but unlikely)" is bloody awful as calling conventions go, especially
since the case when 0 happens to be a descriptor is not going to get
a lot of testing on the userland side.

Don't mix "return an error or descriptor" with "return an error or 0".
It's going to end up a regular source of userland bugs.
Drew DeVault Feb. 28, 2021, 2:21 a.m. UTC | #2
On Sat Feb 27, 2021 at 9:13 PM EST, Al Viro wrote:
> No to the ABI part; "on error it returns -E..., on success - 0 or a
> non-negative number representing a file descriptor (zero also
> possible, but unlikely)" is bloody awful as calling conventions go,
> especially since the case when 0 happens to be a descriptor is not
> going to get a lot of testing on the userland side.

Hm, I was just trying to mimic the behavior of open(2). Do you have a
better suggestion?
Matthew Wilcox Feb. 28, 2021, 2:24 a.m. UTC | #3
On Sat, Feb 27, 2021 at 07:25:00PM -0500, Drew DeVault wrote:
> The mkdir and mkdirat syscalls both return 0 on success, and use of the
> newly-created directory requires a separate open or openat (or openat2)
> call. The time between these syscalls is an opportunity for a race
> condition. It is thus desirable to establish a means of creating a
> directory and returning an open dirfd for it in one atomic operation.

I don't understand what the TOCTOU race is.

$ cd /tmp
$ mkdir foo
$ sudo su fake
$ rmdir foo
rmdir: failed to remove 'foo': Operation not permitted
$ mv foo bar
mv: cannot move 'foo' to 'bar': Operation not permitted

Where's the problem?  If mkdir succeeds in a sticky directory, others
can't remove or rename it.  So how can an app be tricked into doing
something wrong?
Drew DeVault Feb. 28, 2021, 2:26 a.m. UTC | #4
On Sat Feb 27, 2021 at 9:24 PM EST, Matthew Wilcox wrote:
> Where's the problem? If mkdir succeeds in a sticky directory, others
> can't remove or rename it. So how can an app be tricked into doing
> something wrong?

It's not a security concern, it's just about about making the software
more robust.

1. Program A creates a directory
2. Program A is pre-empted
3. Program B deletes the directory
4. Program A creates a file in that directory
5. RIP
Al Viro Feb. 28, 2021, 2:58 a.m. UTC | #5
On Sat, Feb 27, 2021 at 09:21:09PM -0500, Drew DeVault wrote:
> On Sat Feb 27, 2021 at 9:13 PM EST, Al Viro wrote:
> > No to the ABI part; "on error it returns -E..., on success - 0 or a
> > non-negative number representing a file descriptor (zero also
> > possible, but unlikely)" is bloody awful as calling conventions go,
> > especially since the case when 0 happens to be a descriptor is not
> > going to get a lot of testing on the userland side.
> 
> Hm, I was just trying to mimic the behavior of open(2). Do you have a
> better suggestion?

open() *always* returns descriptor or an error, for one thing.
And quite a few of open() flags are completely wrong for mkdir,
starting with symlink following and truncation.

What's more, your implementation is both racy and deadlock-prone -
it repeats the entire pathwalk with no warranty that it'll
arrive to the object you've created *AND* if you have
something like /foo/bar/baz/../../splat and dentry of bar
gets evicted on memory pressure, that pathwalk will end up
trying to look bar up.  In the already locked /foo, aka
/foo/bar/baz/../..

TBH, I don't understand what are you trying to achieve -
what will that mkdir+open combination buy you, especially
since that atomicity goes straight out of window if you try
to use that on e.g. NFS.  How is the userland supposed to make
use of that thing?
Matthew Wilcox Feb. 28, 2021, 4:03 a.m. UTC | #6
On Sat, Feb 27, 2021 at 09:26:21PM -0500, Drew DeVault wrote:
> On Sat Feb 27, 2021 at 9:24 PM EST, Matthew Wilcox wrote:
> > Where's the problem? If mkdir succeeds in a sticky directory, others
> > can't remove or rename it. So how can an app be tricked into doing
> > something wrong?
> 
> It's not a security concern, it's just about about making the software
> more robust.
> 
> 1. Program A creates a directory
> 2. Program A is pre-empted
> 3. Program B deletes the directory
> 4. Program A creates a file in that directory
> 5. RIP

umm ... program B deletes the directory.  program A opens it in order to
use openat().  program A gets ENOENT and exits, confused.  that's the
race you're removing here -- and it seems fairly insignificant to me.
Drew DeVault Feb. 28, 2021, 1:56 p.m. UTC | #7
On Sat Feb 27, 2021 at 9:58 PM EST, Al Viro wrote:
> open() *always* returns descriptor or an error, for one thing.
> And quite a few of open() flags are completely wrong for mkdir,
> starting with symlink following and truncation.

So does mkdirat2. Are you referring to the do_mkdirat2 function? I
merged mkdir/mkdirat/mkdirat2 into one function with a flag to enable
the mkdirat2 behavior, to avoid copying and pasting much of the
functionality. However, the syscalls themselves don't overload their
return value as you expect. mkdir & mkdirat both still return 0 or an
error, and mkdirat2 always returns an fd or an error. If you prefer, I
can leave their implementations separate so that this is more clear.

I supposed the flags might be wrong - should I just introduce a new set
of flags, with the specific ones which are useful (which I think is just
O_CLOEXEC)?

> What's more, your implementation is both racy and deadlock-prone -
> it repeats the entire pathwalk with no warranty that it'll
> arrive to the object you've created *AND* if you have
> something like /foo/bar/baz/../../splat and dentry of bar
> gets evicted on memory pressure, that pathwalk will end up
> trying to look bar up. In the already locked /foo, aka
> /foo/bar/baz/../..

This is down to unfamiliarity with this code, I think. I'll try to give
it a closer look.

> TBH, I don't understand what are you trying to achieve -
> what will that mkdir+open combination buy you, especially
> since that atomicity goes straight out of window if you try
> to use that on e.g. NFS. How is the userland supposed to make
> use of that thing?

I'm trying to close what appears to be an oversight in the API. See the
previous threads:

https://lore.kernel.org/linux-fsdevel/C9KKYZ4T5O53.338Y48UIQ9W3H@taiga/T/#t
https://lore.kernel.org/linux-fsdevel/20200316142057.xo24zea3k5zwswra@yavin/

Userland uses it the same way they use mkdir+open, but in one call, so
that they can use the directory they make as soon as it's created. The
atomicity goal, if possible, would also add a reference to the new
directory via the open fd, so they can use it even if it's removed by
another process. It makes such applications less error-prone, albiet in
a minor edge case.

I'm not sure what's involved with the NFS case, but I can look into it.
Drew DeVault Feb. 28, 2021, 1:57 p.m. UTC | #8
On Sat Feb 27, 2021 at 11:03 PM EST, Matthew Wilcox wrote:
> > 1. Program A creates a directory
> > 2. Program A is pre-empted
> > 3. Program B deletes the directory
> > 4. Program A creates a file in that directory
> > 5. RIP
>
> umm ... program B deletes the directory. program A opens it in order to
> use openat(). program A gets ENOENT and exits, confused. that's the
> race you're removing here -- and it seems fairly insignificant to me.

Yes, that is the race being eliminated here. Instead of this, program A
has an fd which holds a reference to the directory, so it just works. A
race is a race. It's an oversight in the API.
J. Bruce Fields March 1, 2021, 7:02 p.m. UTC | #9
On Sun, Feb 28, 2021 at 02:58:22AM +0000, Al Viro wrote:
> TBH, I don't understand what are you trying to achieve -
> what will that mkdir+open combination buy you, especially
> since that atomicity goes straight out of window if you try
> to use that on e.g. NFS.  How is the userland supposed to make
> use of that thing?

For what it's worth, the RPC that creates a directory can also get a
filehandle of the new directory, so I don't think there's anything in
the NFS protocol that would *prevent* implementing this.  (Whether
that's useful, I don't know.)

--b.
J. Bruce Fields March 1, 2021, 7:09 p.m. UTC | #10
On Sun, Feb 28, 2021 at 08:57:20AM -0500, Drew DeVault wrote:
> On Sat Feb 27, 2021 at 11:03 PM EST, Matthew Wilcox wrote:
> > > 1. Program A creates a directory
> > > 2. Program A is pre-empted
> > > 3. Program B deletes the directory
> > > 4. Program A creates a file in that directory
> > > 5. RIP
> >
> > umm ... program B deletes the directory. program A opens it in order to
> > use openat(). program A gets ENOENT and exits, confused. that's the
> > race you're removing here -- and it seems fairly insignificant to me.
> 
> Yes, that is the race being eliminated here. Instead of this, program A
> has an fd which holds a reference to the directory, so it just works. A
> race is a race. It's an oversight in the API.

Step 4 still fails either way, because you can't create a file in an
unlinked directory, even if you hold a reference to that directory.
What's the behavior change at step 4 that you're hoping for?

--b.
Matthew Wilcox March 1, 2021, 7:35 p.m. UTC | #11
On Mon, Mar 01, 2021 at 02:09:03PM -0500, J. Bruce Fields wrote:
> On Sun, Feb 28, 2021 at 08:57:20AM -0500, Drew DeVault wrote:
> > On Sat Feb 27, 2021 at 11:03 PM EST, Matthew Wilcox wrote:
> > > > 1. Program A creates a directory
> > > > 2. Program A is pre-empted
> > > > 3. Program B deletes the directory
> > > > 4. Program A creates a file in that directory
> > > > 5. RIP
> > >
> > > umm ... program B deletes the directory. program A opens it in order to
> > > use openat(). program A gets ENOENT and exits, confused. that's the
> > > race you're removing here -- and it seems fairly insignificant to me.
> > 
> > Yes, that is the race being eliminated here. Instead of this, program A
> > has an fd which holds a reference to the directory, so it just works. A
> > race is a race. It's an oversight in the API.
> 
> Step 4 still fails either way, because you can't create a file in an
> unlinked directory, even if you hold a reference to that directory.
> What's the behavior change at step 4 that you're hoping for?

If step 3 is 'mv foo bar', then the behaviour change will be that the
files still get created, just as bar/quux, instead of foo/quux.  It's not
clear to me this is necessarily an improvement in behaviour.

(as an aside, i think there's a missing feature in posix -- being able
to atomically replace one directory with another.  you can atomically
replace one file with another with hard links, but since you can't
hardlink a directory, you can't do the same trick.  Maybe you should
just always move files out of a directory instead of moving directories
as a single operation)
J. Bruce Fields March 1, 2021, 8:10 p.m. UTC | #12
On Mon, Mar 01, 2021 at 07:35:37PM +0000, Matthew Wilcox wrote:
> On Mon, Mar 01, 2021 at 02:09:03PM -0500, J. Bruce Fields wrote:
> > On Sun, Feb 28, 2021 at 08:57:20AM -0500, Drew DeVault wrote:
> > > On Sat Feb 27, 2021 at 11:03 PM EST, Matthew Wilcox wrote:
> > > > > 1. Program A creates a directory
> > > > > 2. Program A is pre-empted
> > > > > 3. Program B deletes the directory
> > > > > 4. Program A creates a file in that directory
> > > > > 5. RIP
> > > >
> > > > umm ... program B deletes the directory. program A opens it in order to
> > > > use openat(). program A gets ENOENT and exits, confused. that's the
> > > > race you're removing here -- and it seems fairly insignificant to me.
> > > 
> > > Yes, that is the race being eliminated here. Instead of this, program A
> > > has an fd which holds a reference to the directory, so it just works. A
> > > race is a race. It's an oversight in the API.
> > 
> > Step 4 still fails either way, because you can't create a file in an
> > unlinked directory, even if you hold a reference to that directory.
> > What's the behavior change at step 4 that you're hoping for?
> 
> If step 3 is 'mv foo bar', then the behaviour change will be that the
> files still get created, just as bar/quux, instead of foo/quux.  It's not
> clear to me this is necessarily an improvement in behaviour.

Oh, OK.

Yeah, it'd be useful to have some more detail on how this would be used.

--b.

> (as an aside, i think there's a missing feature in posix -- being able
> to atomically replace one directory with another.  you can atomically
> replace one file with another with hard links, but since you can't
> hardlink a directory, you can't do the same trick.  Maybe you should
> just always move files out of a directory instead of moving directories
> as a single operation)
Amir Goldstein March 2, 2021, 7:13 a.m. UTC | #13
On Sun, Feb 28, 2021 at 4:02 PM Drew DeVault <sir@cmpwn.com> wrote:
>
> On Sat Feb 27, 2021 at 11:03 PM EST, Matthew Wilcox wrote:
> > > 1. Program A creates a directory
> > > 2. Program A is pre-empted
> > > 3. Program B deletes the directory
> > > 4. Program A creates a file in that directory
> > > 5. RIP
> >
> > umm ... program B deletes the directory. program A opens it in order to
> > use openat(). program A gets ENOENT and exits, confused. that's the
> > race you're removing here -- and it seems fairly insignificant to me.
>
> Yes, that is the race being eliminated here. Instead of this, program A
> has an fd which holds a reference to the directory, so it just works. A
> race is a race. It's an oversight in the API.

I think you mixed in confusion with "program B deletes the directory".
That will result, as Matthew wrote in ENOENT because that dir is now
IS_DEADDIR().

I think I understand what you mean with the oversight in the API, but
the use case should involve mkdtemp(3) - make it more like tmpfile(3).
Not that *I* can think of the races this can solve, but I am pretty sure
that people with security background will be able to rationalize this.

You start your pitch by ruling out the option of openat2() with
O_CREAT | O_DIRECTORY, because you have strong emotions
against it (loathe).
I personally do not share this feeling with you, because:
1. The syscall is already used to open directories as well as files
2. The whole idea of openat2() is that you can add new behaviors
    with new open_how flags, so no existing app will be surprised from
    behavior change of  O_CREAT | O_DIRECTORY combination.

For your consideration.

Thanks,
Amir.
Miklos Szeredi March 2, 2021, 8:24 a.m. UTC | #14
On Mon, Mar 1, 2021 at 8:42 PM Matthew Wilcox <willy@infradead.org> wrote:

> (as an aside, i think there's a missing feature in posix -- being able
> to atomically replace one directory with another.

RENAME_EXCHANGE.

Thanks,
Miklos
Aleksa Sarai March 3, 2021, 2:39 a.m. UTC | #15
On 2021-03-02, Amir Goldstein <amir73il@gmail.com> wrote:
> On Sun, Feb 28, 2021 at 4:02 PM Drew DeVault <sir@cmpwn.com> wrote:
> >
> > On Sat Feb 27, 2021 at 11:03 PM EST, Matthew Wilcox wrote:
> > > > 1. Program A creates a directory
> > > > 2. Program A is pre-empted
> > > > 3. Program B deletes the directory
> > > > 4. Program A creates a file in that directory
> > > > 5. RIP
> > >
> > > umm ... program B deletes the directory. program A opens it in order to
> > > use openat(). program A gets ENOENT and exits, confused. that's the
> > > race you're removing here -- and it seems fairly insignificant to me.
> >
> > Yes, that is the race being eliminated here. Instead of this, program A
> > has an fd which holds a reference to the directory, so it just works. A
> > race is a race. It's an oversight in the API.
> 
> I think you mixed in confusion with "program B deletes the directory".
> That will result, as Matthew wrote in ENOENT because that dir is now
> IS_DEADDIR().
> 
> I think I understand what you mean with the oversight in the API, but
> the use case should involve mkdtemp(3) - make it more like tmpfile(3).
> Not that *I* can think of the races this can solve, but I am pretty sure
> that people with security background will be able to rationalize this.
> 
> You start your pitch by ruling out the option of openat2() with
> O_CREAT | O_DIRECTORY, because you have strong emotions
> against it (loathe).
> I personally do not share this feeling with you, because:
> 1. The syscall is already used to open directories as well as files

Al NACKed doing it as part of open[1]. My understanding is that the main
two reasons for that were:

 1. open() and mkdir() have different semantics for resolving paths. For
	instance, open(O_CREAT) will create a file at the target of dangling
	symlink but mkdir() will not allow that. I believe there's also some
	funky trailing-"/" handling with mkdirat() as well.

 2. Adding more multiplexers is bad. openat2(2)

I think (1) alone is a strong enough justification, since I don't think
there's precedent for having two different path lookup semantics in the
same VFS syscall.

> 2. The whole idea of openat2() is that you can add new behaviors
>     with new open_how flags, so no existing app will be surprised from
>     behavior change of  O_CREAT | O_DIRECTORY combination.

While it is true that you *could* do this with openat2(), the intention
of openat2() was to allow us to add new arguments openat() if those
arguments make sense within the context of an "open" operation.

(An example would be the opath_mask stuff which I included in the
original series, and am working on re-sending -- that is an additional
argument for O_PATH, and is still clearly linked to opening something.)

[1]: https://lore.kernel.org/linux-fsdevel/20200313182844.GO23230@ZenIV.linux.org.uk/
Stefan Metzmacher March 8, 2021, 1:50 p.m. UTC | #16
Am 01.03.21 um 20:02 schrieb J. Bruce Fields:
> On Sun, Feb 28, 2021 at 02:58:22AM +0000, Al Viro wrote:
>> TBH, I don't understand what are you trying to achieve -
>> what will that mkdir+open combination buy you, especially
>> since that atomicity goes straight out of window if you try
>> to use that on e.g. NFS.  How is the userland supposed to make
>> use of that thing?
> 
> For what it's worth, the RPC that creates a directory can also get a
> filehandle of the new directory, so I don't think there's anything in
> the NFS protocol that would *prevent* implementing this.  (Whether
> that's useful, I don't know.)

The same applies to SMB, there's only a single SMB2/3 Create call,
which is able to create/open files or directories and returns
an open file handle for it.

With an atomic mkdir+open it would be possible have a single round trip
between client and server. It would help on the client, but also for Samba
as a server, as we would be able to skip additional syscalls.

And it would be great to have a way to specify flags similar to O_CREAT
and O_EXCL in order to create a new directory or open an existing one.

It should also be possible to pass in RESOLVE_* flags, so a similar call like openat2()
would be great.

For me openat2() with O_CREAT | O_EXCL | O_DIRECTORY, would be the natural thing to
support, because it's natural in the SMB protocol. But Al seems to hate that and I'm fine
with his arguments against that.

Plus we can't use that anyway as it's currently not rejected with EINVAL,
instead a regular file is created on disk, but -1 ENOTDIR
is returned to userspace.

Currently userspace needs to do something like this in order to be safe for
a given untrusted directory path string (userdirpath) being to be opened
(and created if it doesn't exist):

1. make a copy of userdirpath and call dirname() => dirnameresult
2. make a copy of userdirpath and call basename() => basenameresult
3. call dirfd = openat2(basedirfd, dirnameresult, how = {.flags = O_PATH | O_CLOEXEC, .resolve = RESOLVE_BENEATH});
4. call mkdirat(dirfd, basenameresult, 0755)
5. call close(dirfd)
6. ignore possible EEXIST from mkdirat
7. call fd = openat2(basedirfd, userdirpath, how = { .flags = O_DIRECTORY | O_CLOEXEC, .resolve = RESOLVE_BENEATH});

This requires memory allocations and 4 syscall round trips.

It would be wonderful to just have a single syscall for this.
I'm not sure about the exact details of the API or a possible name
for such a syscall (mkdirat2 seems wrong), but it could look like this:

struct somenewdirsyscall_how {
	__u64 flags; / only O_CLOEXEC, O_CREAT, O_EXCL */
        __u64 mode;
        __u64 resolve;
};

Instead of reusing O_* flags, new defines could also be used.

fd = somenewdirsyscall(basedirfd, userdirpath, how = { .flags = O_CLOEXEC | O_CREAT, .mask = 0755, .resolve = RESOLVE_BENEATH});

What would be a good way forward here?

metze
diff mbox series

Patch

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index ee7b01bb7346..c621fa5aaccf 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -480,3 +480,4 @@ 
 548	common	pidfd_getfd			sys_pidfd_getfd
 549	common	faccessat2			sys_faccessat2
 550	common	process_madvise			sys_process_madvise
+551	common	mkdirat2			sys_mkdirat2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index d056a548358e..8ad43ac853b4 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -454,3 +454,4 @@ 
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
 440	common	process_madvise			sys_process_madvise
+441	common	mkdirat2			sys_mkdirat2
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index b3b2019f8d16..86a9d7b3eabe 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@ 
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		441
+#define __NR_compat_syscalls		442
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index 107f08e03b9f..b9ae6fbba839 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -889,6 +889,8 @@  __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
 __SYSCALL(__NR_faccessat2, sys_faccessat2)
 #define __NR_process_madvise 440
 __SYSCALL(__NR_process_madvise, sys_process_madvise)
+#define __NR_mkdirat2 441
+__SYSCALL(__NR_mkdirat2, sys_mkdirat2)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index b96ed8b8a508..e71ee20bf3da 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -361,3 +361,4 @@ 
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
 440	common	process_madvise			sys_process_madvise
+441	common	mkdirat2			sys_mkdirat2
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 625fb6d32842..a64bc3463b48 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -440,3 +440,4 @@ 
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
 440	common	process_madvise			sys_process_madvise
+441	common	mkdirat2			sys_mkdirat2
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index aae729c95cf9..199f4a6df658 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -446,3 +446,4 @@ 
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
 440	common	process_madvise			sys_process_madvise
+441	common	mkdirat2			sys_mkdirat2
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 32817c954435..6fab6f6e9b0f 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -379,3 +379,4 @@ 
 438	n32	pidfd_getfd			sys_pidfd_getfd
 439	n32	faccessat2			sys_faccessat2
 440	n32	process_madvise			sys_process_madvise
+441	n32	mkdirat2			sys_mkdirat2
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 9e4ea3c31b1c..f9c672d00e75 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -355,3 +355,4 @@ 
 438	n64	pidfd_getfd			sys_pidfd_getfd
 439	n64	faccessat2			sys_faccessat2
 440	n64	process_madvise			sys_process_madvise
+441	n64	mkdirat2			sys_mkdirat2
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index f375ea528e59..a1f433044a7c 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -438,3 +438,4 @@ 
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
 440	common	process_madvise			sys_process_madvise
+441	common	mkdirat2			sys_mkdirat2
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 1275daec7fec..1c73b34517d1 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -530,3 +530,4 @@ 
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
 440	common	process_madvise			sys_process_madvise
+441	common	mkdirat2			sys_mkdirat2
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index 28c168000483..2cc370736912 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -443,3 +443,4 @@ 
 438  common	pidfd_getfd		sys_pidfd_getfd			sys_pidfd_getfd
 439  common	faccessat2		sys_faccessat2			sys_faccessat2
 440  common	process_madvise		sys_process_madvise		sys_process_madvise
+441  common	mkdirat2		sys_mkdirat2			sys_mkdirat2
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 783738448ff5..e0e15828c19f 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -443,3 +443,4 @@ 
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
 440	common	process_madvise			sys_process_madvise
+441	common	mkdirat2			sys_mkdirat2
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index 78160260991b..57bee5e64645 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -486,3 +486,4 @@ 
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
 440	common	process_madvise			sys_process_madvise
+441	common	mkdirat2			sys_mkdirat2
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 379819244b91..8bb125749d7b 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -362,6 +362,7 @@ 
 438	common	pidfd_getfd		sys_pidfd_getfd
 439	common	faccessat2		sys_faccessat2
 440	common	process_madvise		sys_process_madvise
+441	common	mkdirat2		sys_mkdirat2
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index b070f272995d..07f1ddc8092f 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -411,3 +411,4 @@ 
 438	common	pidfd_getfd			sys_pidfd_getfd
 439	common	faccessat2			sys_faccessat2
 440	common	process_madvise			sys_process_madvise
+440	common	mkdirat2			sys_mkdirat2
diff --git a/fs/namei.c b/fs/namei.c
index d4a6dd772303..6bd296d929d7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3654,13 +3654,27 @@  int vfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 }
 EXPORT_SYMBOL(vfs_mkdir);
 
-static long do_mkdirat(int dfd, const char __user *pathname, umode_t mode)
+static long do_mkdirat2(int dfd, const char __user *pathname, umode_t mode,
+			int flags, bool open)
 {
+	struct open_flags op;
 	struct dentry *dentry;
+	struct filename *tmp;
 	struct path path;
 	int error;
+	int fd;
 	unsigned int lookup_flags = LOOKUP_DIRECTORY;
 
+	if (open) {
+		tmp = getname(pathname);
+		if (IS_ERR(tmp))
+			return PTR_ERR(tmp);
+
+		fd = get_unused_fd_flags(flags);
+		if (fd < 0)
+			return fd;
+	}
+
 retry:
 	dentry = user_path_create(dfd, pathname, &path, lookup_flags);
 	if (IS_ERR(dentry))
@@ -3669,24 +3683,46 @@  static long do_mkdirat(int dfd, const char __user *pathname, umode_t mode)
 	if (!IS_POSIXACL(path.dentry->d_inode))
 		mode &= ~current_umask();
 	error = security_path_mkdir(&path, dentry, mode);
-	if (!error)
+	if (!error) {
 		error = vfs_mkdir(path.dentry->d_inode, dentry, mode);
+		if (open) {
+			struct file *f = do_filp_open(dfd, tmp, &op);
+			if (IS_ERR(f)) {
+				put_unused_fd(fd);
+				error = PTR_ERR(f);
+				goto out;
+			} else {
+				fsnotify_open(f);
+				fd_install(fd, f);
+			}
+		}
+	}
 	done_path_create(&path, dentry);
 	if (retry_estale(error, lookup_flags)) {
 		lookup_flags |= LOOKUP_REVAL;
 		goto retry;
 	}
-	return error;
+out:
+	if (open)
+		putname(tmp);
+	if (error < 0 || !open)
+		return error;
+	return fd;
+}
+
+SYSCALL_DEFINE4(mkdirat2, int, dfd, const char __user *, pathname, umode_t, mode, int, flags)
+{
+	return do_mkdirat2(dfd, pathname, mode, flags, true);
 }
 
 SYSCALL_DEFINE3(mkdirat, int, dfd, const char __user *, pathname, umode_t, mode)
 {
-	return do_mkdirat(dfd, pathname, mode);
+	return do_mkdirat2(dfd, pathname, mode, 0, false);
 }
 
 SYSCALL_DEFINE2(mkdir, const char __user *, pathname, umode_t, mode)
 {
-	return do_mkdirat(AT_FDCWD, pathname, mode);
+	return do_mkdirat2(AT_FDCWD, pathname, mode, 0, false);
 }
 
 int vfs_rmdir(struct inode *dir, struct dentry *dentry)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index aea0ce9f3b74..8dd5d7acc333 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -417,6 +417,7 @@  asmlinkage long sys_flock(unsigned int fd, unsigned int cmd);
 asmlinkage long sys_mknodat(int dfd, const char __user * filename, umode_t mode,
 			    unsigned dev);
 asmlinkage long sys_mkdirat(int dfd, const char __user * pathname, umode_t mode);
+asmlinkage long sys_mkdirat2(int dfd, const char __user * pathname, umode_t mode, int flags);
 asmlinkage long sys_unlinkat(int dfd, const char __user * pathname, int flag);
 asmlinkage long sys_symlinkat(const char __user * oldname,
 			      int newdfd, const char __user * newname);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index 2056318988f7..5a4604461ede 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -859,9 +859,11 @@  __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd)
 __SYSCALL(__NR_faccessat2, sys_faccessat2)
 #define __NR_process_madvise 440
 __SYSCALL(__NR_process_madvise, sys_process_madvise)
+#define __NR_process_madvise 441
+__SYSCALL(__NR_mkdirat2, sys_mkdirat2)
 
 #undef __NR_syscalls
-#define __NR_syscalls 441
+#define __NR_syscalls 442
 
 /*
  * 32 bit systems traditionally used different