Message ID | 20210228002500.11483-1-sir@cmpwn.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [RFC] fs: introduce mkdirat2 syscall for atomic mkdir | expand |
On Sat, Feb 27, 2021 at 07:25:00PM -0500, Drew DeVault wrote: > This introduces mkdirat2, along with the requisite flag argument, which > presently accepts the same flags as open - allowing the caller to > specify, say, O_CLOEXEC - and leaving us room to expand the next time an > unforeseeable addition to mkdir is called for. Otherwise, it behaves > identically to mkdirat, but returns an open file descriptor for the new > directory. No to the ABI part; "on error it returns -E..., on success - 0 or a non-negative number representing a file descriptor (zero also possible, but unlikely)" is bloody awful as calling conventions go, especially since the case when 0 happens to be a descriptor is not going to get a lot of testing on the userland side. Don't mix "return an error or descriptor" with "return an error or 0". It's going to end up a regular source of userland bugs.
On Sat Feb 27, 2021 at 9:13 PM EST, Al Viro wrote: > No to the ABI part; "on error it returns -E..., on success - 0 or a > non-negative number representing a file descriptor (zero also > possible, but unlikely)" is bloody awful as calling conventions go, > especially since the case when 0 happens to be a descriptor is not > going to get a lot of testing on the userland side. Hm, I was just trying to mimic the behavior of open(2). Do you have a better suggestion?
On Sat, Feb 27, 2021 at 07:25:00PM -0500, Drew DeVault wrote: > The mkdir and mkdirat syscalls both return 0 on success, and use of the > newly-created directory requires a separate open or openat (or openat2) > call. The time between these syscalls is an opportunity for a race > condition. It is thus desirable to establish a means of creating a > directory and returning an open dirfd for it in one atomic operation. I don't understand what the TOCTOU race is. $ cd /tmp $ mkdir foo $ sudo su fake $ rmdir foo rmdir: failed to remove 'foo': Operation not permitted $ mv foo bar mv: cannot move 'foo' to 'bar': Operation not permitted Where's the problem? If mkdir succeeds in a sticky directory, others can't remove or rename it. So how can an app be tricked into doing something wrong?
On Sat Feb 27, 2021 at 9:24 PM EST, Matthew Wilcox wrote: > Where's the problem? If mkdir succeeds in a sticky directory, others > can't remove or rename it. So how can an app be tricked into doing > something wrong? It's not a security concern, it's just about about making the software more robust. 1. Program A creates a directory 2. Program A is pre-empted 3. Program B deletes the directory 4. Program A creates a file in that directory 5. RIP
On Sat, Feb 27, 2021 at 09:21:09PM -0500, Drew DeVault wrote: > On Sat Feb 27, 2021 at 9:13 PM EST, Al Viro wrote: > > No to the ABI part; "on error it returns -E..., on success - 0 or a > > non-negative number representing a file descriptor (zero also > > possible, but unlikely)" is bloody awful as calling conventions go, > > especially since the case when 0 happens to be a descriptor is not > > going to get a lot of testing on the userland side. > > Hm, I was just trying to mimic the behavior of open(2). Do you have a > better suggestion? open() *always* returns descriptor or an error, for one thing. And quite a few of open() flags are completely wrong for mkdir, starting with symlink following and truncation. What's more, your implementation is both racy and deadlock-prone - it repeats the entire pathwalk with no warranty that it'll arrive to the object you've created *AND* if you have something like /foo/bar/baz/../../splat and dentry of bar gets evicted on memory pressure, that pathwalk will end up trying to look bar up. In the already locked /foo, aka /foo/bar/baz/../.. TBH, I don't understand what are you trying to achieve - what will that mkdir+open combination buy you, especially since that atomicity goes straight out of window if you try to use that on e.g. NFS. How is the userland supposed to make use of that thing?
On Sat, Feb 27, 2021 at 09:26:21PM -0500, Drew DeVault wrote: > On Sat Feb 27, 2021 at 9:24 PM EST, Matthew Wilcox wrote: > > Where's the problem? If mkdir succeeds in a sticky directory, others > > can't remove or rename it. So how can an app be tricked into doing > > something wrong? > > It's not a security concern, it's just about about making the software > more robust. > > 1. Program A creates a directory > 2. Program A is pre-empted > 3. Program B deletes the directory > 4. Program A creates a file in that directory > 5. RIP umm ... program B deletes the directory. program A opens it in order to use openat(). program A gets ENOENT and exits, confused. that's the race you're removing here -- and it seems fairly insignificant to me.
On Sat Feb 27, 2021 at 9:58 PM EST, Al Viro wrote: > open() *always* returns descriptor or an error, for one thing. > And quite a few of open() flags are completely wrong for mkdir, > starting with symlink following and truncation. So does mkdirat2. Are you referring to the do_mkdirat2 function? I merged mkdir/mkdirat/mkdirat2 into one function with a flag to enable the mkdirat2 behavior, to avoid copying and pasting much of the functionality. However, the syscalls themselves don't overload their return value as you expect. mkdir & mkdirat both still return 0 or an error, and mkdirat2 always returns an fd or an error. If you prefer, I can leave their implementations separate so that this is more clear. I supposed the flags might be wrong - should I just introduce a new set of flags, with the specific ones which are useful (which I think is just O_CLOEXEC)? > What's more, your implementation is both racy and deadlock-prone - > it repeats the entire pathwalk with no warranty that it'll > arrive to the object you've created *AND* if you have > something like /foo/bar/baz/../../splat and dentry of bar > gets evicted on memory pressure, that pathwalk will end up > trying to look bar up. In the already locked /foo, aka > /foo/bar/baz/../.. This is down to unfamiliarity with this code, I think. I'll try to give it a closer look. > TBH, I don't understand what are you trying to achieve - > what will that mkdir+open combination buy you, especially > since that atomicity goes straight out of window if you try > to use that on e.g. NFS. How is the userland supposed to make > use of that thing? I'm trying to close what appears to be an oversight in the API. See the previous threads: https://lore.kernel.org/linux-fsdevel/C9KKYZ4T5O53.338Y48UIQ9W3H@taiga/T/#t https://lore.kernel.org/linux-fsdevel/20200316142057.xo24zea3k5zwswra@yavin/ Userland uses it the same way they use mkdir+open, but in one call, so that they can use the directory they make as soon as it's created. The atomicity goal, if possible, would also add a reference to the new directory via the open fd, so they can use it even if it's removed by another process. It makes such applications less error-prone, albiet in a minor edge case. I'm not sure what's involved with the NFS case, but I can look into it.
On Sat Feb 27, 2021 at 11:03 PM EST, Matthew Wilcox wrote: > > 1. Program A creates a directory > > 2. Program A is pre-empted > > 3. Program B deletes the directory > > 4. Program A creates a file in that directory > > 5. RIP > > umm ... program B deletes the directory. program A opens it in order to > use openat(). program A gets ENOENT and exits, confused. that's the > race you're removing here -- and it seems fairly insignificant to me. Yes, that is the race being eliminated here. Instead of this, program A has an fd which holds a reference to the directory, so it just works. A race is a race. It's an oversight in the API.
On Sun, Feb 28, 2021 at 02:58:22AM +0000, Al Viro wrote: > TBH, I don't understand what are you trying to achieve - > what will that mkdir+open combination buy you, especially > since that atomicity goes straight out of window if you try > to use that on e.g. NFS. How is the userland supposed to make > use of that thing? For what it's worth, the RPC that creates a directory can also get a filehandle of the new directory, so I don't think there's anything in the NFS protocol that would *prevent* implementing this. (Whether that's useful, I don't know.) --b.
On Sun, Feb 28, 2021 at 08:57:20AM -0500, Drew DeVault wrote: > On Sat Feb 27, 2021 at 11:03 PM EST, Matthew Wilcox wrote: > > > 1. Program A creates a directory > > > 2. Program A is pre-empted > > > 3. Program B deletes the directory > > > 4. Program A creates a file in that directory > > > 5. RIP > > > > umm ... program B deletes the directory. program A opens it in order to > > use openat(). program A gets ENOENT and exits, confused. that's the > > race you're removing here -- and it seems fairly insignificant to me. > > Yes, that is the race being eliminated here. Instead of this, program A > has an fd which holds a reference to the directory, so it just works. A > race is a race. It's an oversight in the API. Step 4 still fails either way, because you can't create a file in an unlinked directory, even if you hold a reference to that directory. What's the behavior change at step 4 that you're hoping for? --b.
On Mon, Mar 01, 2021 at 02:09:03PM -0500, J. Bruce Fields wrote: > On Sun, Feb 28, 2021 at 08:57:20AM -0500, Drew DeVault wrote: > > On Sat Feb 27, 2021 at 11:03 PM EST, Matthew Wilcox wrote: > > > > 1. Program A creates a directory > > > > 2. Program A is pre-empted > > > > 3. Program B deletes the directory > > > > 4. Program A creates a file in that directory > > > > 5. RIP > > > > > > umm ... program B deletes the directory. program A opens it in order to > > > use openat(). program A gets ENOENT and exits, confused. that's the > > > race you're removing here -- and it seems fairly insignificant to me. > > > > Yes, that is the race being eliminated here. Instead of this, program A > > has an fd which holds a reference to the directory, so it just works. A > > race is a race. It's an oversight in the API. > > Step 4 still fails either way, because you can't create a file in an > unlinked directory, even if you hold a reference to that directory. > What's the behavior change at step 4 that you're hoping for? If step 3 is 'mv foo bar', then the behaviour change will be that the files still get created, just as bar/quux, instead of foo/quux. It's not clear to me this is necessarily an improvement in behaviour. (as an aside, i think there's a missing feature in posix -- being able to atomically replace one directory with another. you can atomically replace one file with another with hard links, but since you can't hardlink a directory, you can't do the same trick. Maybe you should just always move files out of a directory instead of moving directories as a single operation)
On Mon, Mar 01, 2021 at 07:35:37PM +0000, Matthew Wilcox wrote: > On Mon, Mar 01, 2021 at 02:09:03PM -0500, J. Bruce Fields wrote: > > On Sun, Feb 28, 2021 at 08:57:20AM -0500, Drew DeVault wrote: > > > On Sat Feb 27, 2021 at 11:03 PM EST, Matthew Wilcox wrote: > > > > > 1. Program A creates a directory > > > > > 2. Program A is pre-empted > > > > > 3. Program B deletes the directory > > > > > 4. Program A creates a file in that directory > > > > > 5. RIP > > > > > > > > umm ... program B deletes the directory. program A opens it in order to > > > > use openat(). program A gets ENOENT and exits, confused. that's the > > > > race you're removing here -- and it seems fairly insignificant to me. > > > > > > Yes, that is the race being eliminated here. Instead of this, program A > > > has an fd which holds a reference to the directory, so it just works. A > > > race is a race. It's an oversight in the API. > > > > Step 4 still fails either way, because you can't create a file in an > > unlinked directory, even if you hold a reference to that directory. > > What's the behavior change at step 4 that you're hoping for? > > If step 3 is 'mv foo bar', then the behaviour change will be that the > files still get created, just as bar/quux, instead of foo/quux. It's not > clear to me this is necessarily an improvement in behaviour. Oh, OK. Yeah, it'd be useful to have some more detail on how this would be used. --b. > (as an aside, i think there's a missing feature in posix -- being able > to atomically replace one directory with another. you can atomically > replace one file with another with hard links, but since you can't > hardlink a directory, you can't do the same trick. Maybe you should > just always move files out of a directory instead of moving directories > as a single operation)
On Sun, Feb 28, 2021 at 4:02 PM Drew DeVault <sir@cmpwn.com> wrote: > > On Sat Feb 27, 2021 at 11:03 PM EST, Matthew Wilcox wrote: > > > 1. Program A creates a directory > > > 2. Program A is pre-empted > > > 3. Program B deletes the directory > > > 4. Program A creates a file in that directory > > > 5. RIP > > > > umm ... program B deletes the directory. program A opens it in order to > > use openat(). program A gets ENOENT and exits, confused. that's the > > race you're removing here -- and it seems fairly insignificant to me. > > Yes, that is the race being eliminated here. Instead of this, program A > has an fd which holds a reference to the directory, so it just works. A > race is a race. It's an oversight in the API. I think you mixed in confusion with "program B deletes the directory". That will result, as Matthew wrote in ENOENT because that dir is now IS_DEADDIR(). I think I understand what you mean with the oversight in the API, but the use case should involve mkdtemp(3) - make it more like tmpfile(3). Not that *I* can think of the races this can solve, but I am pretty sure that people with security background will be able to rationalize this. You start your pitch by ruling out the option of openat2() with O_CREAT | O_DIRECTORY, because you have strong emotions against it (loathe). I personally do not share this feeling with you, because: 1. The syscall is already used to open directories as well as files 2. The whole idea of openat2() is that you can add new behaviors with new open_how flags, so no existing app will be surprised from behavior change of O_CREAT | O_DIRECTORY combination. For your consideration. Thanks, Amir.
On Mon, Mar 1, 2021 at 8:42 PM Matthew Wilcox <willy@infradead.org> wrote: > (as an aside, i think there's a missing feature in posix -- being able > to atomically replace one directory with another. RENAME_EXCHANGE. Thanks, Miklos
On 2021-03-02, Amir Goldstein <amir73il@gmail.com> wrote: > On Sun, Feb 28, 2021 at 4:02 PM Drew DeVault <sir@cmpwn.com> wrote: > > > > On Sat Feb 27, 2021 at 11:03 PM EST, Matthew Wilcox wrote: > > > > 1. Program A creates a directory > > > > 2. Program A is pre-empted > > > > 3. Program B deletes the directory > > > > 4. Program A creates a file in that directory > > > > 5. RIP > > > > > > umm ... program B deletes the directory. program A opens it in order to > > > use openat(). program A gets ENOENT and exits, confused. that's the > > > race you're removing here -- and it seems fairly insignificant to me. > > > > Yes, that is the race being eliminated here. Instead of this, program A > > has an fd which holds a reference to the directory, so it just works. A > > race is a race. It's an oversight in the API. > > I think you mixed in confusion with "program B deletes the directory". > That will result, as Matthew wrote in ENOENT because that dir is now > IS_DEADDIR(). > > I think I understand what you mean with the oversight in the API, but > the use case should involve mkdtemp(3) - make it more like tmpfile(3). > Not that *I* can think of the races this can solve, but I am pretty sure > that people with security background will be able to rationalize this. > > You start your pitch by ruling out the option of openat2() with > O_CREAT | O_DIRECTORY, because you have strong emotions > against it (loathe). > I personally do not share this feeling with you, because: > 1. The syscall is already used to open directories as well as files Al NACKed doing it as part of open[1]. My understanding is that the main two reasons for that were: 1. open() and mkdir() have different semantics for resolving paths. For instance, open(O_CREAT) will create a file at the target of dangling symlink but mkdir() will not allow that. I believe there's also some funky trailing-"/" handling with mkdirat() as well. 2. Adding more multiplexers is bad. openat2(2) I think (1) alone is a strong enough justification, since I don't think there's precedent for having two different path lookup semantics in the same VFS syscall. > 2. The whole idea of openat2() is that you can add new behaviors > with new open_how flags, so no existing app will be surprised from > behavior change of O_CREAT | O_DIRECTORY combination. While it is true that you *could* do this with openat2(), the intention of openat2() was to allow us to add new arguments openat() if those arguments make sense within the context of an "open" operation. (An example would be the opath_mask stuff which I included in the original series, and am working on re-sending -- that is an additional argument for O_PATH, and is still clearly linked to opening something.) [1]: https://lore.kernel.org/linux-fsdevel/20200313182844.GO23230@ZenIV.linux.org.uk/
Am 01.03.21 um 20:02 schrieb J. Bruce Fields: > On Sun, Feb 28, 2021 at 02:58:22AM +0000, Al Viro wrote: >> TBH, I don't understand what are you trying to achieve - >> what will that mkdir+open combination buy you, especially >> since that atomicity goes straight out of window if you try >> to use that on e.g. NFS. How is the userland supposed to make >> use of that thing? > > For what it's worth, the RPC that creates a directory can also get a > filehandle of the new directory, so I don't think there's anything in > the NFS protocol that would *prevent* implementing this. (Whether > that's useful, I don't know.) The same applies to SMB, there's only a single SMB2/3 Create call, which is able to create/open files or directories and returns an open file handle for it. With an atomic mkdir+open it would be possible have a single round trip between client and server. It would help on the client, but also for Samba as a server, as we would be able to skip additional syscalls. And it would be great to have a way to specify flags similar to O_CREAT and O_EXCL in order to create a new directory or open an existing one. It should also be possible to pass in RESOLVE_* flags, so a similar call like openat2() would be great. For me openat2() with O_CREAT | O_EXCL | O_DIRECTORY, would be the natural thing to support, because it's natural in the SMB protocol. But Al seems to hate that and I'm fine with his arguments against that. Plus we can't use that anyway as it's currently not rejected with EINVAL, instead a regular file is created on disk, but -1 ENOTDIR is returned to userspace. Currently userspace needs to do something like this in order to be safe for a given untrusted directory path string (userdirpath) being to be opened (and created if it doesn't exist): 1. make a copy of userdirpath and call dirname() => dirnameresult 2. make a copy of userdirpath and call basename() => basenameresult 3. call dirfd = openat2(basedirfd, dirnameresult, how = {.flags = O_PATH | O_CLOEXEC, .resolve = RESOLVE_BENEATH}); 4. call mkdirat(dirfd, basenameresult, 0755) 5. call close(dirfd) 6. ignore possible EEXIST from mkdirat 7. call fd = openat2(basedirfd, userdirpath, how = { .flags = O_DIRECTORY | O_CLOEXEC, .resolve = RESOLVE_BENEATH}); This requires memory allocations and 4 syscall round trips. It would be wonderful to just have a single syscall for this. I'm not sure about the exact details of the API or a possible name for such a syscall (mkdirat2 seems wrong), but it could look like this: struct somenewdirsyscall_how { __u64 flags; / only O_CLOEXEC, O_CREAT, O_EXCL */ __u64 mode; __u64 resolve; }; Instead of reusing O_* flags, new defines could also be used. fd = somenewdirsyscall(basedirfd, userdirpath, how = { .flags = O_CLOEXEC | O_CREAT, .mask = 0755, .resolve = RESOLVE_BENEATH}); What would be a good way forward here? metze
diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl index ee7b01bb7346..c621fa5aaccf 100644 --- a/arch/alpha/kernel/syscalls/syscall.tbl +++ b/arch/alpha/kernel/syscalls/syscall.tbl @@ -480,3 +480,4 @@ 548 common pidfd_getfd sys_pidfd_getfd 549 common faccessat2 sys_faccessat2 550 common process_madvise sys_process_madvise +551 common mkdirat2 sys_mkdirat2 diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl index d056a548358e..8ad43ac853b4 100644 --- a/arch/arm/tools/syscall.tbl +++ b/arch/arm/tools/syscall.tbl @@ -454,3 +454,4 @@ 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 440 common process_madvise sys_process_madvise +441 common mkdirat2 sys_mkdirat2 diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h index b3b2019f8d16..86a9d7b3eabe 100644 --- a/arch/arm64/include/asm/unistd.h +++ b/arch/arm64/include/asm/unistd.h @@ -38,7 +38,7 @@ #define __ARM_NR_compat_set_tls (__ARM_NR_COMPAT_BASE + 5) #define __ARM_NR_COMPAT_END (__ARM_NR_COMPAT_BASE + 0x800) -#define __NR_compat_syscalls 441 +#define __NR_compat_syscalls 442 #endif #define __ARCH_WANT_SYS_CLONE diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h index 107f08e03b9f..b9ae6fbba839 100644 --- a/arch/arm64/include/asm/unistd32.h +++ b/arch/arm64/include/asm/unistd32.h @@ -889,6 +889,8 @@ __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd) __SYSCALL(__NR_faccessat2, sys_faccessat2) #define __NR_process_madvise 440 __SYSCALL(__NR_process_madvise, sys_process_madvise) +#define __NR_mkdirat2 441 +__SYSCALL(__NR_mkdirat2, sys_mkdirat2) /* * Please add new compat syscalls above this comment and update diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl index b96ed8b8a508..e71ee20bf3da 100644 --- a/arch/ia64/kernel/syscalls/syscall.tbl +++ b/arch/ia64/kernel/syscalls/syscall.tbl @@ -361,3 +361,4 @@ 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 440 common process_madvise sys_process_madvise +441 common mkdirat2 sys_mkdirat2 diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl index 625fb6d32842..a64bc3463b48 100644 --- a/arch/m68k/kernel/syscalls/syscall.tbl +++ b/arch/m68k/kernel/syscalls/syscall.tbl @@ -440,3 +440,4 @@ 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 440 common process_madvise sys_process_madvise +441 common mkdirat2 sys_mkdirat2 diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl index aae729c95cf9..199f4a6df658 100644 --- a/arch/microblaze/kernel/syscalls/syscall.tbl +++ b/arch/microblaze/kernel/syscalls/syscall.tbl @@ -446,3 +446,4 @@ 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 440 common process_madvise sys_process_madvise +441 common mkdirat2 sys_mkdirat2 diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl index 32817c954435..6fab6f6e9b0f 100644 --- a/arch/mips/kernel/syscalls/syscall_n32.tbl +++ b/arch/mips/kernel/syscalls/syscall_n32.tbl @@ -379,3 +379,4 @@ 438 n32 pidfd_getfd sys_pidfd_getfd 439 n32 faccessat2 sys_faccessat2 440 n32 process_madvise sys_process_madvise +441 n32 mkdirat2 sys_mkdirat2 diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl index 9e4ea3c31b1c..f9c672d00e75 100644 --- a/arch/mips/kernel/syscalls/syscall_n64.tbl +++ b/arch/mips/kernel/syscalls/syscall_n64.tbl @@ -355,3 +355,4 @@ 438 n64 pidfd_getfd sys_pidfd_getfd 439 n64 faccessat2 sys_faccessat2 440 n64 process_madvise sys_process_madvise +441 n64 mkdirat2 sys_mkdirat2 diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl index f375ea528e59..a1f433044a7c 100644 --- a/arch/parisc/kernel/syscalls/syscall.tbl +++ b/arch/parisc/kernel/syscalls/syscall.tbl @@ -438,3 +438,4 @@ 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 440 common process_madvise sys_process_madvise +441 common mkdirat2 sys_mkdirat2 diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl index 1275daec7fec..1c73b34517d1 100644 --- a/arch/powerpc/kernel/syscalls/syscall.tbl +++ b/arch/powerpc/kernel/syscalls/syscall.tbl @@ -530,3 +530,4 @@ 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 440 common process_madvise sys_process_madvise +441 common mkdirat2 sys_mkdirat2 diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl index 28c168000483..2cc370736912 100644 --- a/arch/s390/kernel/syscalls/syscall.tbl +++ b/arch/s390/kernel/syscalls/syscall.tbl @@ -443,3 +443,4 @@ 438 common pidfd_getfd sys_pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 sys_faccessat2 440 common process_madvise sys_process_madvise sys_process_madvise +441 common mkdirat2 sys_mkdirat2 sys_mkdirat2 diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl index 783738448ff5..e0e15828c19f 100644 --- a/arch/sh/kernel/syscalls/syscall.tbl +++ b/arch/sh/kernel/syscalls/syscall.tbl @@ -443,3 +443,4 @@ 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 440 common process_madvise sys_process_madvise +441 common mkdirat2 sys_mkdirat2 diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl index 78160260991b..57bee5e64645 100644 --- a/arch/sparc/kernel/syscalls/syscall.tbl +++ b/arch/sparc/kernel/syscalls/syscall.tbl @@ -486,3 +486,4 @@ 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 440 common process_madvise sys_process_madvise +441 common mkdirat2 sys_mkdirat2 diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl index 379819244b91..8bb125749d7b 100644 --- a/arch/x86/entry/syscalls/syscall_64.tbl +++ b/arch/x86/entry/syscalls/syscall_64.tbl @@ -362,6 +362,7 @@ 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 440 common process_madvise sys_process_madvise +441 common mkdirat2 sys_mkdirat2 # # Due to a historical design error, certain syscalls are numbered differently diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl index b070f272995d..07f1ddc8092f 100644 --- a/arch/xtensa/kernel/syscalls/syscall.tbl +++ b/arch/xtensa/kernel/syscalls/syscall.tbl @@ -411,3 +411,4 @@ 438 common pidfd_getfd sys_pidfd_getfd 439 common faccessat2 sys_faccessat2 440 common process_madvise sys_process_madvise +440 common mkdirat2 sys_mkdirat2 diff --git a/fs/namei.c b/fs/namei.c index d4a6dd772303..6bd296d929d7 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3654,13 +3654,27 @@ int vfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) } EXPORT_SYMBOL(vfs_mkdir); -static long do_mkdirat(int dfd, const char __user *pathname, umode_t mode) +static long do_mkdirat2(int dfd, const char __user *pathname, umode_t mode, + int flags, bool open) { + struct open_flags op; struct dentry *dentry; + struct filename *tmp; struct path path; int error; + int fd; unsigned int lookup_flags = LOOKUP_DIRECTORY; + if (open) { + tmp = getname(pathname); + if (IS_ERR(tmp)) + return PTR_ERR(tmp); + + fd = get_unused_fd_flags(flags); + if (fd < 0) + return fd; + } + retry: dentry = user_path_create(dfd, pathname, &path, lookup_flags); if (IS_ERR(dentry)) @@ -3669,24 +3683,46 @@ static long do_mkdirat(int dfd, const char __user *pathname, umode_t mode) if (!IS_POSIXACL(path.dentry->d_inode)) mode &= ~current_umask(); error = security_path_mkdir(&path, dentry, mode); - if (!error) + if (!error) { error = vfs_mkdir(path.dentry->d_inode, dentry, mode); + if (open) { + struct file *f = do_filp_open(dfd, tmp, &op); + if (IS_ERR(f)) { + put_unused_fd(fd); + error = PTR_ERR(f); + goto out; + } else { + fsnotify_open(f); + fd_install(fd, f); + } + } + } done_path_create(&path, dentry); if (retry_estale(error, lookup_flags)) { lookup_flags |= LOOKUP_REVAL; goto retry; } - return error; +out: + if (open) + putname(tmp); + if (error < 0 || !open) + return error; + return fd; +} + +SYSCALL_DEFINE4(mkdirat2, int, dfd, const char __user *, pathname, umode_t, mode, int, flags) +{ + return do_mkdirat2(dfd, pathname, mode, flags, true); } SYSCALL_DEFINE3(mkdirat, int, dfd, const char __user *, pathname, umode_t, mode) { - return do_mkdirat(dfd, pathname, mode); + return do_mkdirat2(dfd, pathname, mode, 0, false); } SYSCALL_DEFINE2(mkdir, const char __user *, pathname, umode_t, mode) { - return do_mkdirat(AT_FDCWD, pathname, mode); + return do_mkdirat2(AT_FDCWD, pathname, mode, 0, false); } int vfs_rmdir(struct inode *dir, struct dentry *dentry) diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index aea0ce9f3b74..8dd5d7acc333 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -417,6 +417,7 @@ asmlinkage long sys_flock(unsigned int fd, unsigned int cmd); asmlinkage long sys_mknodat(int dfd, const char __user * filename, umode_t mode, unsigned dev); asmlinkage long sys_mkdirat(int dfd, const char __user * pathname, umode_t mode); +asmlinkage long sys_mkdirat2(int dfd, const char __user * pathname, umode_t mode, int flags); asmlinkage long sys_unlinkat(int dfd, const char __user * pathname, int flag); asmlinkage long sys_symlinkat(const char __user * oldname, int newdfd, const char __user * newname); diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h index 2056318988f7..5a4604461ede 100644 --- a/include/uapi/asm-generic/unistd.h +++ b/include/uapi/asm-generic/unistd.h @@ -859,9 +859,11 @@ __SYSCALL(__NR_pidfd_getfd, sys_pidfd_getfd) __SYSCALL(__NR_faccessat2, sys_faccessat2) #define __NR_process_madvise 440 __SYSCALL(__NR_process_madvise, sys_process_madvise) +#define __NR_process_madvise 441 +__SYSCALL(__NR_mkdirat2, sys_mkdirat2) #undef __NR_syscalls -#define __NR_syscalls 441 +#define __NR_syscalls 442 /* * 32 bit systems traditionally used different