Message ID | f6d632fc82d4750b73c83a2f1d1b9972cf3e26bb.1650553693.git.qemu_oss@crudebyte.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | 9pfs: macOS host fixes | expand |
On Thu, 21 Apr 2022 17:07:43 +0200 Christian Schoenebeck <qemu_oss@crudebyte.com> wrote: > mknod() on macOS does not support creating sockets, so divert to > call sequence socket(), bind() and chmod() respectively if S_IFSOCK > was passed with mode argument. > > Link: https://lore.kernel.org/qemu-devel/17933734.zYzKuhC07K@silver/ > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com> > Reviewed-by: Will Cohen <wwcohen@gmail.com> > --- > hw/9pfs/9p-util-darwin.c | 27 ++++++++++++++++++++++++++- > 1 file changed, 26 insertions(+), 1 deletion(-) > > diff --git a/hw/9pfs/9p-util-darwin.c b/hw/9pfs/9p-util-darwin.c > index e24d09763a..39308f2a45 100644 > --- a/hw/9pfs/9p-util-darwin.c > +++ b/hw/9pfs/9p-util-darwin.c > @@ -74,6 +74,27 @@ int fsetxattrat_nofollow(int dirfd, const char *filename, const char *name, > */ > #if defined CONFIG_PTHREAD_FCHDIR_NP > > +static int create_socket_file_at_cwd(const char *filename, mode_t mode) { > + int fd, err; > + struct sockaddr_un addr = { > + .sun_family = AF_UNIX > + }; > + > + fd = socket(PF_UNIX, SOCK_DGRAM, 0); > + if (fd == -1) { > + return fd; > + } > + snprintf(addr.sun_path, sizeof(addr.sun_path), "./%s", filename); > + err = bind(fd, (struct sockaddr *) &addr, sizeof(addr)); > + if (err == -1) { > + goto out; > + } > + err = chmod(addr.sun_path, mode); > +out: > + close(fd); You need close_preserve_errno() here. Rest LGTM, so with that fixed, you can add: Reviewed-by: Greg Kurz <groug@kaod.org> > + return err; > +} > + > int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t dev) > { > int preserved_errno, err; > @@ -93,7 +114,11 @@ int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t dev) > if (pthread_fchdir_np(dirfd) < 0) { > return -1; > } > - err = mknod(filename, mode, dev); > + if (S_ISSOCK(mode)) { > + err = create_socket_file_at_cwd(filename, mode); > + } else { > + err = mknod(filename, mode, dev); > + } > preserved_errno = errno; > /* Stop using the thread-local cwd */ > pthread_fchdir_np(-1);
On Donnerstag, 21. April 2022 18:36:31 CEST Greg Kurz wrote: > On Thu, 21 Apr 2022 17:07:43 +0200 > > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote: > > mknod() on macOS does not support creating sockets, so divert to > > call sequence socket(), bind() and chmod() respectively if S_IFSOCK > > was passed with mode argument. > > > > Link: https://lore.kernel.org/qemu-devel/17933734.zYzKuhC07K@silver/ > > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com> > > Reviewed-by: Will Cohen <wwcohen@gmail.com> > > --- > > > > hw/9pfs/9p-util-darwin.c | 27 ++++++++++++++++++++++++++- > > 1 file changed, 26 insertions(+), 1 deletion(-) > > > > diff --git a/hw/9pfs/9p-util-darwin.c b/hw/9pfs/9p-util-darwin.c > > index e24d09763a..39308f2a45 100644 > > --- a/hw/9pfs/9p-util-darwin.c > > +++ b/hw/9pfs/9p-util-darwin.c > > @@ -74,6 +74,27 @@ int fsetxattrat_nofollow(int dirfd, const char > > *filename, const char *name,> > > */ > > > > #if defined CONFIG_PTHREAD_FCHDIR_NP > > > > +static int create_socket_file_at_cwd(const char *filename, mode_t mode) { > > + int fd, err; > > + struct sockaddr_un addr = { > > + .sun_family = AF_UNIX > > + }; > > + > > + fd = socket(PF_UNIX, SOCK_DGRAM, 0); > > + if (fd == -1) { > > + return fd; > > + } > > + snprintf(addr.sun_path, sizeof(addr.sun_path), "./%s", filename); > > + err = bind(fd, (struct sockaddr *) &addr, sizeof(addr)); > > + if (err == -1) { > > + goto out; > > + } > > + err = chmod(addr.sun_path, mode); > > +out: > > + close(fd); > > You need close_preserve_errno() here. > > Rest LGTM, so with that fixed, you can add: > > Reviewed-by: Greg Kurz <groug@kaod.org> Right, unlike patch 1, we might come from an error path here when closing. I'll just s/close/close_preserve_errno/ this on my end before queuing, without sending a v3, unless somebody finds something else in this series. Thanks! Best regards, Christian Schoenebeck
On 2022/04/22 0:07, Christian Schoenebeck wrote: > mknod() on macOS does not support creating sockets, so divert to > call sequence socket(), bind() and chmod() respectively if S_IFSOCK > was passed with mode argument. > > Link: https://lore.kernel.org/qemu-devel/17933734.zYzKuhC07K@silver/ > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com> > Reviewed-by: Will Cohen <wwcohen@gmail.com> > --- > hw/9pfs/9p-util-darwin.c | 27 ++++++++++++++++++++++++++- > 1 file changed, 26 insertions(+), 1 deletion(-) > > diff --git a/hw/9pfs/9p-util-darwin.c b/hw/9pfs/9p-util-darwin.c > index e24d09763a..39308f2a45 100644 > --- a/hw/9pfs/9p-util-darwin.c > +++ b/hw/9pfs/9p-util-darwin.c > @@ -74,6 +74,27 @@ int fsetxattrat_nofollow(int dirfd, const char *filename, const char *name, > */ > #if defined CONFIG_PTHREAD_FCHDIR_NP > > +static int create_socket_file_at_cwd(const char *filename, mode_t mode) { > + int fd, err; > + struct sockaddr_un addr = { > + .sun_family = AF_UNIX > + }; > + > + fd = socket(PF_UNIX, SOCK_DGRAM, 0); > + if (fd == -1) { > + return fd; > + } > + snprintf(addr.sun_path, sizeof(addr.sun_path), "./%s", filename); It would result in an incorrect path if the path does not fit in addr.sun_path. It should report an explicit error instead. > + err = bind(fd, (struct sockaddr *) &addr, sizeof(addr)); > + if (err == -1) { > + goto out; You may close(fd) as soon as bind() returns (before checking the returned value) and eliminate goto. > + } > + err = chmod(addr.sun_path, mode); I'm not sure if it is fine to have a time window between bind() and chmod(). Do you have some rationale? Regards, Akihiko Odaki > +out: > + close(fd); > + return err; > +} > + > int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t dev) > { > int preserved_errno, err; > @@ -93,7 +114,11 @@ int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t dev) > if (pthread_fchdir_np(dirfd) < 0) { > return -1; > } > - err = mknod(filename, mode, dev); > + if (S_ISSOCK(mode)) { > + err = create_socket_file_at_cwd(filename, mode); > + } else { > + err = mknod(filename, mode, dev); > + } > preserved_errno = errno; > /* Stop using the thread-local cwd */ > pthread_fchdir_np(-1);
On Freitag, 22. April 2022 04:43:40 CEST Akihiko Odaki wrote: > On 2022/04/22 0:07, Christian Schoenebeck wrote: > > mknod() on macOS does not support creating sockets, so divert to > > call sequence socket(), bind() and chmod() respectively if S_IFSOCK > > was passed with mode argument. > > > > Link: https://lore.kernel.org/qemu-devel/17933734.zYzKuhC07K@silver/ > > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com> > > Reviewed-by: Will Cohen <wwcohen@gmail.com> > > --- > > > > hw/9pfs/9p-util-darwin.c | 27 ++++++++++++++++++++++++++- > > 1 file changed, 26 insertions(+), 1 deletion(-) > > > > diff --git a/hw/9pfs/9p-util-darwin.c b/hw/9pfs/9p-util-darwin.c > > index e24d09763a..39308f2a45 100644 > > --- a/hw/9pfs/9p-util-darwin.c > > +++ b/hw/9pfs/9p-util-darwin.c > > @@ -74,6 +74,27 @@ int fsetxattrat_nofollow(int dirfd, const char > > *filename, const char *name,> > > */ > > > > #if defined CONFIG_PTHREAD_FCHDIR_NP > > > > +static int create_socket_file_at_cwd(const char *filename, mode_t mode) { > > + int fd, err; > > + struct sockaddr_un addr = { > > + .sun_family = AF_UNIX > > + }; > > + > > + fd = socket(PF_UNIX, SOCK_DGRAM, 0); > > + if (fd == -1) { > > + return fd; > > + } > > + snprintf(addr.sun_path, sizeof(addr.sun_path), "./%s", filename); > > It would result in an incorrect path if the path does not fit in > addr.sun_path. It should report an explicit error instead. Looking at its header file, 'sun_path' is indeed defined on macOS with an oddly small size of only 104 bytes. So yes, I should explicitly handle that error case. I'll post a v3. > > + err = bind(fd, (struct sockaddr *) &addr, sizeof(addr)); > > + if (err == -1) { > > + goto out; > > You may close(fd) as soon as bind() returns (before checking the > returned value) and eliminate goto. Yeah, I thought about that alternative, but found it a bit ugly, and probably also counter-productive in case this function might get extended with more error pathes in future. Not that I would insist on the current solution though. > > + } > > + err = chmod(addr.sun_path, mode); > > I'm not sure if it is fine to have a time window between bind() and > chmod(). Do you have some rationale? Good question. QEMU's 9p server is multi-threaded; all 9p requests come in serialized and the 9p server controller portion (9p.c) is only running on QEMU main thread, but the actual filesystem driver calls are then dispatched to QEMU worker threads and therefore running concurrently at this point: https://wiki.qemu.org/Documentation/9p#Threads_and_Coroutines Similar situation on Linux 9p client side: it handles access to a mounted 9p filesystem concurrently, requests are then serialized by 9p driver on Linux and sent over wire to 9p server (host). So yes, there might be implications by that short time windows. But could that be exploited on macOS hosts in practice? The socket file would have mode srwxr-xr-x for a short moment. For security_model=mapped* this should not be a problem. For security_model=none|passhrough, in theory, maybe? But how likely is that? If you are using a Linux client for instance, trying to brute-force opening the socket file, the client would send several 9p commands (Twalk, Tgetattr, Topen, probably more). The time window of the two commands above should be much smaller than that and I would expect one of the 9p commands to error out in between. What would be a viable approach to avoid this issue on macOS? > > +out: > > + close(fd); > > + return err; > > +} > > + > > > > int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t > > dev) > > { > > > > int preserved_errno, err; > > > > @@ -93,7 +114,11 @@ int qemu_mknodat(int dirfd, const char *filename, > > mode_t mode, dev_t dev)> > > if (pthread_fchdir_np(dirfd) < 0) { > > > > return -1; > > > > } > > > > - err = mknod(filename, mode, dev); > > + if (S_ISSOCK(mode)) { > > + err = create_socket_file_at_cwd(filename, mode); > > + } else { > > + err = mknod(filename, mode, dev); > > + } > > > > preserved_errno = errno; > > /* Stop using the thread-local cwd */ > > pthread_fchdir_np(-1);
On 2022/04/22 23:06, Christian Schoenebeck wrote: > On Freitag, 22. April 2022 04:43:40 CEST Akihiko Odaki wrote: >> On 2022/04/22 0:07, Christian Schoenebeck wrote: >>> mknod() on macOS does not support creating sockets, so divert to >>> call sequence socket(), bind() and chmod() respectively if S_IFSOCK >>> was passed with mode argument. >>> >>> Link: https://lore.kernel.org/qemu-devel/17933734.zYzKuhC07K@silver/ >>> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com> >>> Reviewed-by: Will Cohen <wwcohen@gmail.com> >>> --- >>> >>> hw/9pfs/9p-util-darwin.c | 27 ++++++++++++++++++++++++++- >>> 1 file changed, 26 insertions(+), 1 deletion(-) >>> >>> diff --git a/hw/9pfs/9p-util-darwin.c b/hw/9pfs/9p-util-darwin.c >>> index e24d09763a..39308f2a45 100644 >>> --- a/hw/9pfs/9p-util-darwin.c >>> +++ b/hw/9pfs/9p-util-darwin.c >>> @@ -74,6 +74,27 @@ int fsetxattrat_nofollow(int dirfd, const char >>> *filename, const char *name,> >>> */ >>> >>> #if defined CONFIG_PTHREAD_FCHDIR_NP >>> >>> +static int create_socket_file_at_cwd(const char *filename, mode_t mode) { >>> + int fd, err; >>> + struct sockaddr_un addr = { >>> + .sun_family = AF_UNIX >>> + }; >>> + >>> + fd = socket(PF_UNIX, SOCK_DGRAM, 0); >>> + if (fd == -1) { >>> + return fd; >>> + } >>> + snprintf(addr.sun_path, sizeof(addr.sun_path), "./%s", filename); >> >> It would result in an incorrect path if the path does not fit in >> addr.sun_path. It should report an explicit error instead. > > Looking at its header file, 'sun_path' is indeed defined on macOS with an > oddly small size of only 104 bytes. So yes, I should explicitly handle that > error case. > > I'll post a v3. > >>> + err = bind(fd, (struct sockaddr *) &addr, sizeof(addr)); >>> + if (err == -1) { >>> + goto out; >> >> You may close(fd) as soon as bind() returns (before checking the >> returned value) and eliminate goto. > > Yeah, I thought about that alternative, but found it a bit ugly, and probably > also counter-productive in case this function might get extended with more > error pathes in future. Not that I would insist on the current solution > though. I'm happy with the explanation. Thanks. > >>> + } >>> + err = chmod(addr.sun_path, mode); >> >> I'm not sure if it is fine to have a time window between bind() and >> chmod(). Do you have some rationale? > > Good question. QEMU's 9p server is multi-threaded; all 9p requests come in > serialized and the 9p server controller portion (9p.c) is only running on QEMU > main thread, but the actual filesystem driver calls are then dispatched to > QEMU worker threads and therefore running concurrently at this point: > > https://wiki.qemu.org/Documentation/9p#Threads_and_Coroutines > > Similar situation on Linux 9p client side: it handles access to a mounted 9p > filesystem concurrently, requests are then serialized by 9p driver on Linux > and sent over wire to 9p server (host). > > So yes, there might be implications by that short time windows. But could that > be exploited on macOS hosts in practice? > > The socket file would have mode srwxr-xr-x for a short moment. > > For security_model=mapped* this should not be a problem. > > For security_model=none|passhrough, in theory, maybe? But how likely is that? > If you are using a Linux client for instance, trying to brute-force opening > the socket file, the client would send several 9p commands (Twalk, Tgetattr, > Topen, probably more). The time window of the two commands above should be > much smaller than that and I would expect one of the 9p commands to error out > in between. > > What would be a viable approach to avoid this issue on macOS? It is unlikely that a naive brute-force approach will succeed to exploit. The more concerning scenario is that the attacker uses the knowledge of the underlying implementation of macOS to cause resource contention to widen the window. Whether an exploitation is viable depends on how much time you spend digging XNU. However, I'm also not sure if it really *has* a race condition. Looking at v9fs_co_mknod(), it sequentially calls s->ops->mknod() and s->ops->lstat(). It also results in an entity called "path name based fid" in the code, which inherently cannot identify a file when it is renamed or recreated. If there is some rationale it is safe, it may also be applied to the sequence of bind() and chmod(). Can anyone explain the sequence of s->ops->mknod() and s->ops->lstat() or path name based fid in general? Regards, Akihiko Odaki > >>> +out: >>> + close(fd); >>> + return err; >>> +} >>> + >>> >>> int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t >>> dev) >>> { >>> >>> int preserved_errno, err; >>> >>> @@ -93,7 +114,11 @@ int qemu_mknodat(int dirfd, const char *filename, >>> mode_t mode, dev_t dev)> >>> if (pthread_fchdir_np(dirfd) < 0) { >>> >>> return -1; >>> >>> } >>> >>> - err = mknod(filename, mode, dev); >>> + if (S_ISSOCK(mode)) { >>> + err = create_socket_file_at_cwd(filename, mode); >>> + } else { >>> + err = mknod(filename, mode, dev); >>> + } >>> >>> preserved_errno = errno; >>> /* Stop using the thread-local cwd */ >>> pthread_fchdir_np(-1); > >
On Samstag, 23. April 2022 06:33:50 CEST Akihiko Odaki wrote: > On 2022/04/22 23:06, Christian Schoenebeck wrote: > > On Freitag, 22. April 2022 04:43:40 CEST Akihiko Odaki wrote: > >> On 2022/04/22 0:07, Christian Schoenebeck wrote: > >>> mknod() on macOS does not support creating sockets, so divert to > >>> call sequence socket(), bind() and chmod() respectively if S_IFSOCK > >>> was passed with mode argument. > >>> > >>> Link: https://lore.kernel.org/qemu-devel/17933734.zYzKuhC07K@silver/ > >>> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com> > >>> Reviewed-by: Will Cohen <wwcohen@gmail.com> > >>> --- > >>> > >>> hw/9pfs/9p-util-darwin.c | 27 ++++++++++++++++++++++++++- > >>> 1 file changed, 26 insertions(+), 1 deletion(-) > >>> > >>> diff --git a/hw/9pfs/9p-util-darwin.c b/hw/9pfs/9p-util-darwin.c > >>> index e24d09763a..39308f2a45 100644 > >>> --- a/hw/9pfs/9p-util-darwin.c > >>> +++ b/hw/9pfs/9p-util-darwin.c > >>> @@ -74,6 +74,27 @@ int fsetxattrat_nofollow(int dirfd, const char > >>> *filename, const char *name,> > >>> > >>> */ > >>> > >>> #if defined CONFIG_PTHREAD_FCHDIR_NP > >>> > >>> +static int create_socket_file_at_cwd(const char *filename, mode_t mode) > >>> { > >>> + int fd, err; > >>> + struct sockaddr_un addr = { > >>> + .sun_family = AF_UNIX > >>> + }; > >>> + > >>> + fd = socket(PF_UNIX, SOCK_DGRAM, 0); > >>> + if (fd == -1) { > >>> + return fd; > >>> + } > >>> + snprintf(addr.sun_path, sizeof(addr.sun_path), "./%s", filename); > >> > >> It would result in an incorrect path if the path does not fit in > >> addr.sun_path. It should report an explicit error instead. > > > > Looking at its header file, 'sun_path' is indeed defined on macOS with an > > oddly small size of only 104 bytes. So yes, I should explicitly handle > > that > > error case. > > > > I'll post a v3. > > > >>> + err = bind(fd, (struct sockaddr *) &addr, sizeof(addr)); > >>> + if (err == -1) { > >>> + goto out; > >> > >> You may close(fd) as soon as bind() returns (before checking the > >> returned value) and eliminate goto. > > > > Yeah, I thought about that alternative, but found it a bit ugly, and > > probably also counter-productive in case this function might get extended > > with more error pathes in future. Not that I would insist on the current > > solution though. > > I'm happy with the explanation. Thanks. > > >>> + } > >>> + err = chmod(addr.sun_path, mode); > >> > >> I'm not sure if it is fine to have a time window between bind() and > >> chmod(). Do you have some rationale? > > > > Good question. QEMU's 9p server is multi-threaded; all 9p requests come in > > serialized and the 9p server controller portion (9p.c) is only running on > > QEMU main thread, but the actual filesystem driver calls are then > > dispatched to QEMU worker threads and therefore running concurrently at > > this point: > > > > https://wiki.qemu.org/Documentation/9p#Threads_and_Coroutines > > > > Similar situation on Linux 9p client side: it handles access to a mounted > > 9p filesystem concurrently, requests are then serialized by 9p driver on > > Linux and sent over wire to 9p server (host). > > > > So yes, there might be implications by that short time windows. But could > > that be exploited on macOS hosts in practice? > > > > The socket file would have mode srwxr-xr-x for a short moment. > > > > For security_model=mapped* this should not be a problem. > > > > For security_model=none|passhrough, in theory, maybe? But how likely is > > that? If you are using a Linux client for instance, trying to brute-force > > opening the socket file, the client would send several 9p commands > > (Twalk, Tgetattr, Topen, probably more). The time window of the two > > commands above should be much smaller than that and I would expect one of > > the 9p commands to error out in between. > > > > What would be a viable approach to avoid this issue on macOS? > > It is unlikely that a naive brute-force approach will succeed to > exploit. The more concerning scenario is that the attacker uses the > knowledge of the underlying implementation of macOS to cause resource > contention to widen the window. Whether an exploitation is viable > depends on how much time you spend digging XNU. > > However, I'm also not sure if it really *has* a race condition. Looking > at v9fs_co_mknod(), it sequentially calls s->ops->mknod() and > s->ops->lstat(). It also results in an entity called "path name based > fid" in the code, which inherently cannot identify a file when it is > renamed or recreated. > > If there is some rationale it is safe, it may also be applied to the > sequence of bind() and chmod(). Can anyone explain the sequence of > s->ops->mknod() and s->ops->lstat() or path name based fid in general? You are talking about 9p server's controller level: I don't see something that would prevent a concurrent open() during this bind() ... chmod() time window unfortunately. Argument 'fidp' passed to function v9fs_co_mknod() reflects the directory in which the new device file shall be created. So 'fidp' is not the device file here, nor is 'fidp' modified during this function. Function v9fs_co_mknod() is entered by 9p server on QEMU main thread. At the beginning of the function it first acquires a read lock on a (per 9p export) global coroutine mutex: v9fs_path_read_lock(s); and holds this lock until returning from function v9fs_co_mknod(). But that's just a read lock. Function v9fs_co_open() also just gains a read lock. So they can happen concurrently. Then v9fs_co_run_in_worker({...}) is called to dispatch and execute all the code block (think of it as an Obj-C "block") inside this (macro actually) on a QEMU worker thread. So an arbitrary background thread would then call the fs driver functions: s->ops->mknod() v9fs_name_to_path() s->ops->lstat() and then at the end of the code block the background thread would dispatch back to QEMU main thread. So when we are reaching: v9fs_path_unlock(s); we are already back on QEMU main thread, hence unlocking on main thread now and finally leaving function v9fs_co_mknod(). The important thing to understand is, while that v9fs_co_run_in_worker({...}) code block is executed on a QEMU worker thread, the QEMU main thread (9p server controller portion, i.e. 9p.c) is *not* sleeping, QEMU main thread rather continues to process other (if any) client requests in the meantime. In other words v9fs_co_run_in_worker() neither behaves exactly like Apple's GCD dispatch_async(), nor like dispatch_sync(), as GCD is not coroutine based. So 9p server might pull a pending 'Topen' client request from the input FIFO in the meantime and likewise dispatch that to a worker thread, etc. Hence a concurrent open() might in theory be possible, but I find it quite unlikely to succeed in practice as the open() call on guest is translated by Linux client into a bunch of synchronous 9p requests on the path passed with the open() call on guest, and a round trip for each 9p message is like what, ~0.3ms or something in this order. That's quite huge compared to the time window I would expect between bind() ... open(). Does this answer your questions? > Regards, > Akihiko Odaki > > >>> +out: > >>> + close(fd); > >>> + return err; > >>> +} > >>> + > >>> > >>> int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t > >>> dev) > >>> { > >>> > >>> int preserved_errno, err; > >>> > >>> @@ -93,7 +114,11 @@ int qemu_mknodat(int dirfd, const char *filename, > >>> mode_t mode, dev_t dev)> > >>> > >>> if (pthread_fchdir_np(dirfd) < 0) { > >>> > >>> return -1; > >>> > >>> } > >>> > >>> - err = mknod(filename, mode, dev); > >>> + if (S_ISSOCK(mode)) { > >>> + err = create_socket_file_at_cwd(filename, mode); > >>> + } else { > >>> + err = mknod(filename, mode, dev); > >>> + } > >>> > >>> preserved_errno = errno; > >>> /* Stop using the thread-local cwd */ > >>> pthread_fchdir_np(-1);
On 2022/04/25 3:45, Christian Schoenebeck wrote: >>>>> + } >>>>> + err = chmod(addr.sun_path, mode); >>>> >>>> I'm not sure if it is fine to have a time window between bind() and >>>> chmod(). Do you have some rationale? >>> >>> Good question. QEMU's 9p server is multi-threaded; all 9p requests come in >>> serialized and the 9p server controller portion (9p.c) is only running on >>> QEMU main thread, but the actual filesystem driver calls are then >>> dispatched to QEMU worker threads and therefore running concurrently at >>> this point: >>> >>> https://wiki.qemu.org/Documentation/9p#Threads_and_Coroutines >>> >>> Similar situation on Linux 9p client side: it handles access to a mounted >>> 9p filesystem concurrently, requests are then serialized by 9p driver on >>> Linux and sent over wire to 9p server (host). >>> >>> So yes, there might be implications by that short time windows. But could >>> that be exploited on macOS hosts in practice? >>> >>> The socket file would have mode srwxr-xr-x for a short moment. >>> >>> For security_model=mapped* this should not be a problem. >>> >>> For security_model=none|passhrough, in theory, maybe? But how likely is >>> that? If you are using a Linux client for instance, trying to brute-force >>> opening the socket file, the client would send several 9p commands >>> (Twalk, Tgetattr, Topen, probably more). The time window of the two >>> commands above should be much smaller than that and I would expect one of >>> the 9p commands to error out in between. >>> >>> What would be a viable approach to avoid this issue on macOS? >> >> It is unlikely that a naive brute-force approach will succeed to >> exploit. The more concerning scenario is that the attacker uses the >> knowledge of the underlying implementation of macOS to cause resource >> contention to widen the window. Whether an exploitation is viable >> depends on how much time you spend digging XNU. >> >> However, I'm also not sure if it really *has* a race condition. Looking >> at v9fs_co_mknod(), it sequentially calls s->ops->mknod() and >> s->ops->lstat(). It also results in an entity called "path name based >> fid" in the code, which inherently cannot identify a file when it is >> renamed or recreated. >> >> If there is some rationale it is safe, it may also be applied to the >> sequence of bind() and chmod(). Can anyone explain the sequence of >> s->ops->mknod() and s->ops->lstat() or path name based fid in general? > > You are talking about 9p server's controller level: I don't see something that > would prevent a concurrent open() during this bind() ... chmod() time window > unfortunately. > > Argument 'fidp' passed to function v9fs_co_mknod() reflects the directory in > which the new device file shall be created. So 'fidp' is not the device file > here, nor is 'fidp' modified during this function. > > Function v9fs_co_mknod() is entered by 9p server on QEMU main thread. At the > beginning of the function it first acquires a read lock on a (per 9p export) > global coroutine mutex: > > v9fs_path_read_lock(s); > > and holds this lock until returning from function v9fs_co_mknod(). But that's > just a read lock. Function v9fs_co_open() also just gains a read lock. So they > can happen concurrently. > > Then v9fs_co_run_in_worker({...}) is called to dispatch and execute all the > code block (think of it as an Obj-C "block") inside this (macro actually) on a > QEMU worker thread. So an arbitrary background thread would then call the fs > driver functions: > > s->ops->mknod() > v9fs_name_to_path() > s->ops->lstat() > > and then at the end of the code block the background thread would dispatch > back to QEMU main thread. So when we are reaching: > > v9fs_path_unlock(s); > > we are already back on QEMU main thread, hence unlocking on main thread now > and finally leaving function v9fs_co_mknod(). > > The important thing to understand is, while that > > v9fs_co_run_in_worker({...}) > > code block is executed on a QEMU worker thread, the QEMU main thread (9p > server controller portion, i.e. 9p.c) is *not* sleeping, QEMU main thread > rather continues to process other (if any) client requests in the meantime. In > other words v9fs_co_run_in_worker() neither behaves exactly like Apple's GCD > dispatch_async(), nor like dispatch_sync(), as GCD is not coroutine based. > > So 9p server might pull a pending 'Topen' client request from the input FIFO > in the meantime and likewise dispatch that to a worker thread, etc. Hence a > concurrent open() might in theory be possible, but I find it quite unlikely to > succeed in practice as the open() call on guest is translated by Linux client > into a bunch of synchronous 9p requests on the path passed with the open() > call on guest, and a round trip for each 9p message is like what, ~0.3ms or > something in this order. That's quite huge compared to the time window I would > expect between bind() ... open(). > > Does this answer your questions? The time window may be widened by a malicious actor if the actor knows XNU well so the window length inferred from experiences is not really enough to claim it safe, particularly when considering about security. On the other hand, I'm wondering if there is same kind of a time window between s->ops->mknodat() and s->ops->lstat(). Also, there should be similar time windows among operations with "path name based fid" as they also use path names as identifiers. If there is a rationale that it is considered secure, we may be able to apply the same logic to the time window between bind() and chmod() and claim it secure. I need a review from someone who understands that part of the code, therefore. Regards, Akihiko Odaki
On Tue, 26 Apr 2022 12:57:37 +0900 Akihiko Odaki <akihiko.odaki@gmail.com> wrote: > On 2022/04/25 3:45, Christian Schoenebeck wrote: > >>>>> + } > >>>>> + err = chmod(addr.sun_path, mode); > >>>> > >>>> I'm not sure if it is fine to have a time window between bind() and > >>>> chmod(). Do you have some rationale? > >>> > >>> Good question. QEMU's 9p server is multi-threaded; all 9p requests come in > >>> serialized and the 9p server controller portion (9p.c) is only running on > >>> QEMU main thread, but the actual filesystem driver calls are then > >>> dispatched to QEMU worker threads and therefore running concurrently at > >>> this point: > >>> > >>> https://wiki.qemu.org/Documentation/9p#Threads_and_Coroutines > >>> > >>> Similar situation on Linux 9p client side: it handles access to a mounted > >>> 9p filesystem concurrently, requests are then serialized by 9p driver on > >>> Linux and sent over wire to 9p server (host). > >>> > >>> So yes, there might be implications by that short time windows. But could > >>> that be exploited on macOS hosts in practice? > >>> > >>> The socket file would have mode srwxr-xr-x for a short moment. > >>> > >>> For security_model=mapped* this should not be a problem. > >>> > >>> For security_model=none|passhrough, in theory, maybe? But how likely is > >>> that? If you are using a Linux client for instance, trying to brute-force > >>> opening the socket file, the client would send several 9p commands > >>> (Twalk, Tgetattr, Topen, probably more). The time window of the two > >>> commands above should be much smaller than that and I would expect one of > >>> the 9p commands to error out in between. > >>> > >>> What would be a viable approach to avoid this issue on macOS? > >> > >> It is unlikely that a naive brute-force approach will succeed to > >> exploit. The more concerning scenario is that the attacker uses the > >> knowledge of the underlying implementation of macOS to cause resource > >> contention to widen the window. Whether an exploitation is viable > >> depends on how much time you spend digging XNU. > >> > >> However, I'm also not sure if it really *has* a race condition. Looking > >> at v9fs_co_mknod(), it sequentially calls s->ops->mknod() and > >> s->ops->lstat(). It also results in an entity called "path name based > >> fid" in the code, which inherently cannot identify a file when it is > >> renamed or recreated. > >> > >> If there is some rationale it is safe, it may also be applied to the > >> sequence of bind() and chmod(). Can anyone explain the sequence of > >> s->ops->mknod() and s->ops->lstat() or path name based fid in general? > > > > You are talking about 9p server's controller level: I don't see something that > > would prevent a concurrent open() during this bind() ... chmod() time window > > unfortunately. > > > > Argument 'fidp' passed to function v9fs_co_mknod() reflects the directory in > > which the new device file shall be created. So 'fidp' is not the device file > > here, nor is 'fidp' modified during this function. > > > > Function v9fs_co_mknod() is entered by 9p server on QEMU main thread. At the > > beginning of the function it first acquires a read lock on a (per 9p export) > > global coroutine mutex: > > > > v9fs_path_read_lock(s); > > > > and holds this lock until returning from function v9fs_co_mknod(). But that's > > just a read lock. Function v9fs_co_open() also just gains a read lock. So they > > can happen concurrently. > > > > Then v9fs_co_run_in_worker({...}) is called to dispatch and execute all the > > code block (think of it as an Obj-C "block") inside this (macro actually) on a > > QEMU worker thread. So an arbitrary background thread would then call the fs > > driver functions: > > > > s->ops->mknod() > > v9fs_name_to_path() > > s->ops->lstat() > > > > and then at the end of the code block the background thread would dispatch > > back to QEMU main thread. So when we are reaching: > > > > v9fs_path_unlock(s); > > > > we are already back on QEMU main thread, hence unlocking on main thread now > > and finally leaving function v9fs_co_mknod(). > > > > The important thing to understand is, while that > > > > v9fs_co_run_in_worker({...}) > > > > code block is executed on a QEMU worker thread, the QEMU main thread (9p > > server controller portion, i.e. 9p.c) is *not* sleeping, QEMU main thread > > rather continues to process other (if any) client requests in the meantime. In > > other words v9fs_co_run_in_worker() neither behaves exactly like Apple's GCD > > dispatch_async(), nor like dispatch_sync(), as GCD is not coroutine based. > > > > So 9p server might pull a pending 'Topen' client request from the input FIFO > > in the meantime and likewise dispatch that to a worker thread, etc. Hence a > > concurrent open() might in theory be possible, but I find it quite unlikely to > > succeed in practice as the open() call on guest is translated by Linux client > > into a bunch of synchronous 9p requests on the path passed with the open() > > call on guest, and a round trip for each 9p message is like what, ~0.3ms or > > something in this order. That's quite huge compared to the time window I would > > expect between bind() ... open(). > > > > Does this answer your questions? > > The time window may be widened by a malicious actor if the actor knows > XNU well so the window length inferred from experiences is not really > enough to claim it safe, particularly when considering about security. > > On the other hand, I'm wondering if there is same kind of a time window > between s->ops->mknodat() and s->ops->lstat(). Also, there should be > similar time windows among operations with "path name based fid" as they > also use path names as identifiers. If there is a rationale that it is > considered secure, we may be able to apply the same logic to the time > window between bind() and chmod() and claim it secure. > I need a review from someone who understands that part of the code, > therefore. > I think Christian's explanation is clear enough. We don't guarantee that v9fs_co_foo() calls run atomically. As a consequence, the client might see transient states or be able to interact with an ongoing request. And to answer your question, we have no specific rationale on security with that. I'm not sure what the concerns are but unless you come up with a valid scenario [*] I don't see any reason to prevent this patch to go forward. [*] things like: - client escaping the shared directory - QEMU crashing - QEMU hogging host resources - client-side unprivileged user gaining elevated privleges in the guest Cheers, -- Greg > Regards, > Akihiko Odaki
On 2022/04/26 21:38, Greg Kurz wrote: > On Tue, 26 Apr 2022 12:57:37 +0900 > Akihiko Odaki <akihiko.odaki@gmail.com> wrote: > >> On 2022/04/25 3:45, Christian Schoenebeck wrote: >>>>>>> + } >>>>>>> + err = chmod(addr.sun_path, mode); >>>>>> >>>>>> I'm not sure if it is fine to have a time window between bind() and >>>>>> chmod(). Do you have some rationale? >>>>> >>>>> Good question. QEMU's 9p server is multi-threaded; all 9p requests come in >>>>> serialized and the 9p server controller portion (9p.c) is only running on >>>>> QEMU main thread, but the actual filesystem driver calls are then >>>>> dispatched to QEMU worker threads and therefore running concurrently at >>>>> this point: >>>>> >>>>> https://wiki.qemu.org/Documentation/9p#Threads_and_Coroutines >>>>> >>>>> Similar situation on Linux 9p client side: it handles access to a mounted >>>>> 9p filesystem concurrently, requests are then serialized by 9p driver on >>>>> Linux and sent over wire to 9p server (host). >>>>> >>>>> So yes, there might be implications by that short time windows. But could >>>>> that be exploited on macOS hosts in practice? >>>>> >>>>> The socket file would have mode srwxr-xr-x for a short moment. >>>>> >>>>> For security_model=mapped* this should not be a problem. >>>>> >>>>> For security_model=none|passhrough, in theory, maybe? But how likely is >>>>> that? If you are using a Linux client for instance, trying to brute-force >>>>> opening the socket file, the client would send several 9p commands >>>>> (Twalk, Tgetattr, Topen, probably more). The time window of the two >>>>> commands above should be much smaller than that and I would expect one of >>>>> the 9p commands to error out in between. >>>>> >>>>> What would be a viable approach to avoid this issue on macOS? >>>> >>>> It is unlikely that a naive brute-force approach will succeed to >>>> exploit. The more concerning scenario is that the attacker uses the >>>> knowledge of the underlying implementation of macOS to cause resource >>>> contention to widen the window. Whether an exploitation is viable >>>> depends on how much time you spend digging XNU. >>>> >>>> However, I'm also not sure if it really *has* a race condition. Looking >>>> at v9fs_co_mknod(), it sequentially calls s->ops->mknod() and >>>> s->ops->lstat(). It also results in an entity called "path name based >>>> fid" in the code, which inherently cannot identify a file when it is >>>> renamed or recreated. >>>> >>>> If there is some rationale it is safe, it may also be applied to the >>>> sequence of bind() and chmod(). Can anyone explain the sequence of >>>> s->ops->mknod() and s->ops->lstat() or path name based fid in general? >>> >>> You are talking about 9p server's controller level: I don't see something that >>> would prevent a concurrent open() during this bind() ... chmod() time window >>> unfortunately. >>> >>> Argument 'fidp' passed to function v9fs_co_mknod() reflects the directory in >>> which the new device file shall be created. So 'fidp' is not the device file >>> here, nor is 'fidp' modified during this function. >>> >>> Function v9fs_co_mknod() is entered by 9p server on QEMU main thread. At the >>> beginning of the function it first acquires a read lock on a (per 9p export) >>> global coroutine mutex: >>> >>> v9fs_path_read_lock(s); >>> >>> and holds this lock until returning from function v9fs_co_mknod(). But that's >>> just a read lock. Function v9fs_co_open() also just gains a read lock. So they >>> can happen concurrently. >>> >>> Then v9fs_co_run_in_worker({...}) is called to dispatch and execute all the >>> code block (think of it as an Obj-C "block") inside this (macro actually) on a >>> QEMU worker thread. So an arbitrary background thread would then call the fs >>> driver functions: >>> >>> s->ops->mknod() >>> v9fs_name_to_path() >>> s->ops->lstat() >>> >>> and then at the end of the code block the background thread would dispatch >>> back to QEMU main thread. So when we are reaching: >>> >>> v9fs_path_unlock(s); >>> >>> we are already back on QEMU main thread, hence unlocking on main thread now >>> and finally leaving function v9fs_co_mknod(). >>> >>> The important thing to understand is, while that >>> >>> v9fs_co_run_in_worker({...}) >>> >>> code block is executed on a QEMU worker thread, the QEMU main thread (9p >>> server controller portion, i.e. 9p.c) is *not* sleeping, QEMU main thread >>> rather continues to process other (if any) client requests in the meantime. In >>> other words v9fs_co_run_in_worker() neither behaves exactly like Apple's GCD >>> dispatch_async(), nor like dispatch_sync(), as GCD is not coroutine based. >>> >>> So 9p server might pull a pending 'Topen' client request from the input FIFO >>> in the meantime and likewise dispatch that to a worker thread, etc. Hence a >>> concurrent open() might in theory be possible, but I find it quite unlikely to >>> succeed in practice as the open() call on guest is translated by Linux client >>> into a bunch of synchronous 9p requests on the path passed with the open() >>> call on guest, and a round trip for each 9p message is like what, ~0.3ms or >>> something in this order. That's quite huge compared to the time window I would >>> expect between bind() ... open(). >>> >>> Does this answer your questions? >> >> The time window may be widened by a malicious actor if the actor knows >> XNU well so the window length inferred from experiences is not really >> enough to claim it safe, particularly when considering about security. >> >> On the other hand, I'm wondering if there is same kind of a time window >> between s->ops->mknodat() and s->ops->lstat(). Also, there should be >> similar time windows among operations with "path name based fid" as they >> also use path names as identifiers. If there is a rationale that it is >> considered secure, we may be able to apply the same logic to the time >> window between bind() and chmod() and claim it secure. >> I need a review from someone who understands that part of the code, >> therefore. >> > > I think Christian's explanation is clear enough. We don't guarantee > that v9fs_co_foo() calls run atomically. As a consequence, the client > might see transient states or be able to interact with an ongoing > request. And to answer your question, we have no specific rationale > on security with that. > > I'm not sure what the concerns are but unless you come up with a > valid scenario [*] I don't see any reason to prevent this patch > to go forward. > > [*] things like: > - client escaping the shared directory > - QEMU crashing > - QEMU hogging host resources > - client-side unprivileged user gaining elevated privleges > in the guest I was just not sure if such transient states are safe. The past discussion was about the length of the non-atomic time window where a path name is used to identify a particular file, but if such states are not considered problematic, the length does not matter all and we can confidently say the sequence of bind() and chmod() is safe. Considering the transient states are tolerated in 9pfs, we need to design this function to be tolerant with transient states as well. The use of chmod() is not safe when we consider about transient states. A malicious actor may replace the file at the path with a symlink which may escape the shared directory and chmod() will naively follow it. chmod() should be replaced with fchmodat_nofollow() or something similar. Regards, Akihiko Odaki > > Cheers, > > -- > Greg > >> Regards, >> Akihiko Odaki >
On Wed, 27 Apr 2022 11:27:28 +0900 Akihiko Odaki <akihiko.odaki@gmail.com> wrote: > On 2022/04/26 21:38, Greg Kurz wrote: [..skip..] > > > > I think Christian's explanation is clear enough. We don't guarantee > > that v9fs_co_foo() calls run atomically. As a consequence, the client > > might see transient states or be able to interact with an ongoing > > request. And to answer your question, we have no specific rationale > > on security with that. > > > > I'm not sure what the concerns are but unless you come up with a > > valid scenario [*] I don't see any reason to prevent this patch > > to go forward. > > > > [*] things like: > > - client escaping the shared directory > > - QEMU crashing > > - QEMU hogging host resources > > - client-side unprivileged user gaining elevated privleges > > in the guest > > I was just not sure if such transient states are safe. The past > discussion was about the length of the non-atomic time window where a > path name is used to identify a particular file, but if such states are > not considered problematic, the length does not matter all and we can > confidently say the sequence of bind() and chmod() is safe. > > Considering the transient states are tolerated in 9pfs, we need to > design this function to be tolerant with transient states as well. The > use of chmod() is not safe when we consider about transient states. A > malicious actor may replace the file at the path with a symlink which > may escape the shared directory and chmod() will naively follow it. You get a point here. Thanks for your tenacity ! :-) > chmod() should be replaced with fchmodat_nofollow() or something similar. > On a GNU/Linux system, this could be achieved by calling fchmod() on the socket fd *before* calling bind() but I'm afraid this hack might not work with a BSDish OS. Replacing chmod() with fchmodat_nofollow(dirfd, addr.sun_path, mode) won't make things atomic as above but at least it won't follow a malicious symbolic link : mknod() on the client will fail with ELOOP, which is fine when it comes to not breaking out of the shared directory. This brings up a new problem I hadn't realized before : the fchmodat_nofollow() implementation in 9p-local.c is really a linux only thing to cope with AT_SYMLINK_NOFOLLOW not being supported with fchmodat(). It looks that this should move to 9p-util-linux.c and a proper version should be added for macOS in 9p-util-darwin.c Cheers, -- Greg > Regards, > Akihiko Odaki > > > > > Cheers, > > > > -- > > Greg > > > >> Regards, > >> Akihiko Odaki > > >
On Mittwoch, 27. April 2022 12:18:10 CEST Greg Kurz wrote: > On Wed, 27 Apr 2022 11:27:28 +0900 > > Akihiko Odaki <akihiko.odaki@gmail.com> wrote: > > On 2022/04/26 21:38, Greg Kurz wrote: > [..skip..] > > > > I think Christian's explanation is clear enough. We don't guarantee > > > that v9fs_co_foo() calls run atomically. As a consequence, the client > > > might see transient states or be able to interact with an ongoing > > > request. And to answer your question, we have no specific rationale > > > on security with that. > > > > > > I'm not sure what the concerns are but unless you come up with a > > > valid scenario [*] I don't see any reason to prevent this patch > > > to go forward. > > > > > > [*] things like: > > > - client escaping the shared directory > > > - QEMU crashing > > > - QEMU hogging host resources > > > - client-side unprivileged user gaining elevated privleges > > > > > > in the guest > > > > I was just not sure if such transient states are safe. The past > > discussion was about the length of the non-atomic time window where a > > path name is used to identify a particular file, but if such states are > > not considered problematic, the length does not matter all and we can > > confidently say the sequence of bind() and chmod() is safe. > > > > Considering the transient states are tolerated in 9pfs, we need to > > design this function to be tolerant with transient states as well. The > > use of chmod() is not safe when we consider about transient states. A > > malicious actor may replace the file at the path with a symlink which > > may escape the shared directory and chmod() will naively follow it. > > You get a point here. Thanks for your tenacity ! :-) Yep, I send a v4 with fchmodat_nofollow() instead of chmod(), thanks! BTW, why is it actually allowed for client to create a symlink pointing outside exported directory tree with security_model=passthrough/none? Did anybody want that? > > chmod() should be replaced with fchmodat_nofollow() or something similar. > > On a GNU/Linux system, this could be achieved by calling fchmod() on > the socket fd *before* calling bind() but I'm afraid this hack might > not work with a BSDish OS. As you already imagined, this is unfortunately not supported by any BSDs, including macOS. I'll file a bug report with Apple though. > Replacing chmod() with fchmodat_nofollow(dirfd, addr.sun_path, mode) > won't make things atomic as above but at least it won't follow a > malicious symbolic link : mknod() on the client will fail with > ELOOP, which is fine when it comes to not breaking out of the shared > directory. Current security_model=passthrough/none already has similar non-atomic operations BTW, so this was not something new. E.g.: static int local_symlink(FsContext *fs_ctx, const char *oldpath, V9fsPath *dir_path, const char *name, FsCred *credp) { ... } else if (fs_ctx->export_flags & V9FS_SM_PASSTHROUGH || fs_ctx->export_flags & V9FS_SM_NONE) { err = symlinkat(oldpath, dirfd, name); if (err) { goto out; } err = fchownat(dirfd, name, credp->fc_uid, credp->fc_gid, AT_SYMLINK_NOFOLLOW); ... } In general, if you care about a higher degree of security, I'd always recommend to use security_model=mapped in the first place. > This brings up a new problem I hadn't realized before : the > fchmodat_nofollow() implementation in 9p-local.c is really > a linux only thing to cope with AT_SYMLINK_NOFOLLOW not being > supported with fchmodat(). It looks that this should move to > 9p-util-linux.c and a proper version should be added for macOS > in 9p-util-darwin.c Like already agreed on the other thread, yes, that makes sense. But I think this can be handled with a follow-up, separate from this series. Best regards, Christian Schoenebeck
On Wed, 27 Apr 2022 14:32:53 +0200 Christian Schoenebeck <qemu_oss@crudebyte.com> wrote: > On Mittwoch, 27. April 2022 12:18:10 CEST Greg Kurz wrote: > > On Wed, 27 Apr 2022 11:27:28 +0900 > > > > Akihiko Odaki <akihiko.odaki@gmail.com> wrote: > > > On 2022/04/26 21:38, Greg Kurz wrote: > > [..skip..] > > > > > > I think Christian's explanation is clear enough. We don't guarantee > > > > that v9fs_co_foo() calls run atomically. As a consequence, the client > > > > might see transient states or be able to interact with an ongoing > > > > request. And to answer your question, we have no specific rationale > > > > on security with that. > > > > > > > > I'm not sure what the concerns are but unless you come up with a > > > > valid scenario [*] I don't see any reason to prevent this patch > > > > to go forward. > > > > > > > > [*] things like: > > > > - client escaping the shared directory > > > > - QEMU crashing > > > > - QEMU hogging host resources > > > > - client-side unprivileged user gaining elevated privleges > > > > > > > > in the guest > > > > > > I was just not sure if such transient states are safe. The past > > > discussion was about the length of the non-atomic time window where a > > > path name is used to identify a particular file, but if such states are > > > not considered problematic, the length does not matter all and we can > > > confidently say the sequence of bind() and chmod() is safe. > > > > > > Considering the transient states are tolerated in 9pfs, we need to > > > design this function to be tolerant with transient states as well. The > > > use of chmod() is not safe when we consider about transient states. A > > > malicious actor may replace the file at the path with a symlink which > > > may escape the shared directory and chmod() will naively follow it. > > > > You get a point here. Thanks for your tenacity ! :-) > > Yep, I send a v4 with fchmodat_nofollow() instead of chmod(), thanks! > > BTW, why is it actually allowed for client to create a symlink pointing > outside exported directory tree with security_model=passthrough/none? Did > anybody want that? > The target argument to symlink() is merely a string that is stored in the inode. It is only evaluated as a path at the time an action is made on the link. Checking at symlink() time is thus useless. Anyway, we're safe on this side since it's the client's job to resolve links and we explicitly don't follow them in the server. > > > chmod() should be replaced with fchmodat_nofollow() or something similar. > > > > On a GNU/Linux system, this could be achieved by calling fchmod() on > > the socket fd *before* calling bind() but I'm afraid this hack might > > not work with a BSDish OS. > > As you already imagined, this is unfortunately not supported by any BSDs, > including macOS. I'll file a bug report with Apple though. > I'm not sure if this is documented and supported behavior on linux either. > > Replacing chmod() with fchmodat_nofollow(dirfd, addr.sun_path, mode) > > won't make things atomic as above but at least it won't follow a > > malicious symbolic link : mknod() on the client will fail with > > ELOOP, which is fine when it comes to not breaking out of the shared > > directory. > > Current security_model=passthrough/none already has similar non-atomic > operations BTW, so this was not something new. E.g.: > > static int local_symlink(FsContext *fs_ctx, const char *oldpath, > V9fsPath *dir_path, const char *name, FsCred *credp) > { > ... > } else if (fs_ctx->export_flags & V9FS_SM_PASSTHROUGH || > fs_ctx->export_flags & V9FS_SM_NONE) { > err = symlinkat(oldpath, dirfd, name); > if (err) { > goto out; > } > err = fchownat(dirfd, name, credp->fc_uid, credp->fc_gid, > AT_SYMLINK_NOFOLLOW); > ... > } > Yes similar window but this is secure since fchownat() supports AT_SYMLINK_NOFOLLOW. > In general, if you care about a higher degree of security, I'd always > recommend to use security_model=mapped in the first place. > > > This brings up a new problem I hadn't realized before : the > > fchmodat_nofollow() implementation in 9p-local.c is really > > a linux only thing to cope with AT_SYMLINK_NOFOLLOW not being > > supported with fchmodat(). It looks that this should move to > > 9p-util-linux.c and a proper version should be added for macOS > > in 9p-util-darwin.c > > Like already agreed on the other thread, yes, that makes sense. But I think > this can be handled with a follow-up, separate from this series. > Fair enough if you want to handle fchmodat_nofollow() later but you must at least use fchmodat(AT_SYMLINK_NOFOLLOW) in this patch instead of chmod(). > Best regards, > Christian Schoenebeck > >
On Mittwoch, 27. April 2022 15:31:42 CEST Greg Kurz wrote: > On Wed, 27 Apr 2022 14:32:53 +0200 > > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote: > > On Mittwoch, 27. April 2022 12:18:10 CEST Greg Kurz wrote: > > > On Wed, 27 Apr 2022 11:27:28 +0900 > > > > > > Akihiko Odaki <akihiko.odaki@gmail.com> wrote: > > > > On 2022/04/26 21:38, Greg Kurz wrote: [...] > > > > Considering the transient states are tolerated in 9pfs, we need to > > > > design this function to be tolerant with transient states as well. The > > > > use of chmod() is not safe when we consider about transient states. A > > > > malicious actor may replace the file at the path with a symlink which > > > > may escape the shared directory and chmod() will naively follow it. > > > > > > You get a point here. Thanks for your tenacity ! :-) > > > > Yep, I send a v4 with fchmodat_nofollow() instead of chmod(), thanks! > > > > BTW, why is it actually allowed for client to create a symlink pointing > > outside exported directory tree with security_model=passthrough/none? Did > > anybody want that? > > The target argument to symlink() is merely a string that is stored in > the inode. It is only evaluated as a path at the time an action is > made on the link. Checking at symlink() time is thus useless. > > Anyway, we're safe on this side since it's the client's job to > resolve links and we explicitly don't follow them in the server. I wouldn't call it useless, because it is easier to keep exactly 1 hole stuffed instead of being forced to continuously stuff N holes as new ones may popup up over time, as this case shows (mea culpa). So you are trading (probably minor) performance for security. But my question was actually meant seriously: whether there was anybody in the past who probably actually wanted to create relative symlinks outside the exported tree. For instance for another service with wider host filesystem access. [...] > > > This brings up a new problem I hadn't realized before : the > > > fchmodat_nofollow() implementation in 9p-local.c is really > > > a linux only thing to cope with AT_SYMLINK_NOFOLLOW not being > > > supported with fchmodat(). It looks that this should move to > > > 9p-util-linux.c and a proper version should be added for macOS > > > in 9p-util-darwin.c > > > > Like already agreed on the other thread, yes, that makes sense. But I > > think > > this can be handled with a follow-up, separate from this series. > > Fair enough if you want to handle fchmodat_nofollow() later but you > must at least use fchmodat(AT_SYMLINK_NOFOLLOW) in this patch > instead of chmod(). Sounds like a quick and easy workaround. However looking at 'man fchmodat' on macOS, this probably does not exactly do what you wanted it to, at least not in this particular case: AT_SYMLINK_NOFOLLOW If path names a symbolic link, then the mode of the symbolic link is changed. AT_SYMLINK_NOFOLLOW_ANY If path names a symbolic link, then the mode of the symbolic link is changed and if if the path has any other symbolic links, an error is returned. So if somebody replaced the socket file by a 1st order symlink, you would adjust the symlink's permissions with both AT_SYMLINK_NOFOLLOW as well as with AT_SYMLINK_NOFOLLOW_ANY. I mean it's better than chmod(), but acceptable? Using our existing fchmodat_nofollow() instead is no viable alternative either, as it uses operations that are not supported on socket files on macOS (tested). Best regards, Christian Schoenebeck
On Wed, Apr 27, 2022 at 12:18 PM Christian Schoenebeck < qemu_oss@crudebyte.com> wrote: > On Mittwoch, 27. April 2022 15:31:42 CEST Greg Kurz wrote: > > On Wed, 27 Apr 2022 14:32:53 +0200 > > > > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote: > > > On Mittwoch, 27. April 2022 12:18:10 CEST Greg Kurz wrote: > > > > On Wed, 27 Apr 2022 11:27:28 +0900 > > > > > > > > Akihiko Odaki <akihiko.odaki@gmail.com> wrote: > > > > > On 2022/04/26 21:38, Greg Kurz wrote: > [...] > > > > > Considering the transient states are tolerated in 9pfs, we need to > > > > > design this function to be tolerant with transient states as well. > The > > > > > use of chmod() is not safe when we consider about transient > states. A > > > > > malicious actor may replace the file at the path with a symlink > which > > > > > may escape the shared directory and chmod() will naively follow it. > > > > > > > > You get a point here. Thanks for your tenacity ! :-) > > > > > > Yep, I send a v4 with fchmodat_nofollow() instead of chmod(), thanks! > > > > > > BTW, why is it actually allowed for client to create a symlink pointing > > > outside exported directory tree with security_model=passthrough/none? > Did > > > anybody want that? > > > > The target argument to symlink() is merely a string that is stored in > > the inode. It is only evaluated as a path at the time an action is > > made on the link. Checking at symlink() time is thus useless. > > > > Anyway, we're safe on this side since it's the client's job to > > resolve links and we explicitly don't follow them in the server. > > I wouldn't call it useless, because it is easier to keep exactly 1 hole > stuffed instead of being forced to continuously stuff N holes as new ones > may > popup up over time, as this case shows (mea culpa). > > So you are trading (probably minor) performance for security. > > But my question was actually meant seriously: whether there was anybody in > the > past who probably actually wanted to create relative symlinks outside the > exported tree. For instance for another service with wider host filesystem > access. > For what it's worth, the inability to follow symlinks read-write outside of the tree appears to be, at the moment, the primary pain point for new users of 9pfs in containerization software (see the later comments in https://github.com/lima-vm/lima/pull/726 and to a lesser extent https://github.com/containers/podman/issues/13784). To my knowledge, neither podman nor lima have come up with conclusive preferred solutions for how to address this, much less had a robust discussion with the QEMU team. The lima discussion notes that it works read-only with passthrough/none, so I'd suggest that if there weren't users of it before, there are now! As I understand it, one partial solution for downstream software that allows for read-write may just be to more proactively mount larger directories to minimize the number of external paths that symlinks might get tripped up on. That said, this will stop working when it comes to linking to additional mounts, since /Volumes on darwin will never line up with /mnt. > [...] > > > > This brings up a new problem I hadn't realized before : the > > > > fchmodat_nofollow() implementation in 9p-local.c is really > > > > a linux only thing to cope with AT_SYMLINK_NOFOLLOW not being > > > > supported with fchmodat(). It looks that this should move to > > > > 9p-util-linux.c and a proper version should be added for macOS > > > > in 9p-util-darwin.c > > > > > > Like already agreed on the other thread, yes, that makes sense. But I > > > think > > > this can be handled with a follow-up, separate from this series. > > > > Fair enough if you want to handle fchmodat_nofollow() later but you > > must at least use fchmodat(AT_SYMLINK_NOFOLLOW) in this patch > > instead of chmod(). > > Sounds like a quick and easy workaround. However looking at 'man fchmodat' > on > macOS, this probably does not exactly do what you wanted it to, at least > not > in this particular case: > > AT_SYMLINK_NOFOLLOW > If path names a symbolic link, then the mode of the symbolic > link is changed. > > AT_SYMLINK_NOFOLLOW_ANY > If path names a symbolic link, then the mode of the symbolic > link is changed and > if if the path has any other symbolic links, an error is > returned. > > So if somebody replaced the socket file by a 1st order symlink, you would > adjust the symlink's permissions with both AT_SYMLINK_NOFOLLOW as well as > with > AT_SYMLINK_NOFOLLOW_ANY. I mean it's better than chmod(), but acceptable? > > Using our existing fchmodat_nofollow() instead is no viable alternative > either, as it uses operations that are not supported on socket files on > macOS > (tested). > > Best regards, > Christian Schoenebeck > > >
On Wed, 27 Apr 2022 18:18:31 +0200 Christian Schoenebeck <qemu_oss@crudebyte.com> wrote: > On Mittwoch, 27. April 2022 15:31:42 CEST Greg Kurz wrote: > > On Wed, 27 Apr 2022 14:32:53 +0200 > > > > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote: > > > On Mittwoch, 27. April 2022 12:18:10 CEST Greg Kurz wrote: > > > > On Wed, 27 Apr 2022 11:27:28 +0900 > > > > > > > > Akihiko Odaki <akihiko.odaki@gmail.com> wrote: > > > > > On 2022/04/26 21:38, Greg Kurz wrote: > [...] > > > > > Considering the transient states are tolerated in 9pfs, we need to > > > > > design this function to be tolerant with transient states as well. The > > > > > use of chmod() is not safe when we consider about transient states. A > > > > > malicious actor may replace the file at the path with a symlink which > > > > > may escape the shared directory and chmod() will naively follow it. > > > > > > > > You get a point here. Thanks for your tenacity ! :-) > > > > > > Yep, I send a v4 with fchmodat_nofollow() instead of chmod(), thanks! > > > > > > BTW, why is it actually allowed for client to create a symlink pointing > > > outside exported directory tree with security_model=passthrough/none? Did > > > anybody want that? > > > > The target argument to symlink() is merely a string that is stored in > > the inode. It is only evaluated as a path at the time an action is > > made on the link. Checking at symlink() time is thus useless. > > > > Anyway, we're safe on this side since it's the client's job to > > resolve links and we explicitly don't follow them in the server. > > I wouldn't call it useless, because it is easier to keep exactly 1 hole > stuffed instead of being forced to continuously stuff N holes as new ones may > popup up over time, as this case shows (mea culpa). > > So you are trading (probably minor) performance for security. > > But my question was actually meant seriously: whether there was anybody in the > past who probably actually wanted to create relative symlinks outside the > exported tree. For instance for another service with wider host filesystem > access. > I took the question seriously :-), the problem is that even if you do a check on the target at symlink() time, you don't know how it will be evaluated in the end. Quick demonstration: $ cd /var/tmp/ $ mkdir foo $ cd foo/ $ # Suppose foo is the jail $ mkdir bar $ ln -sf .. bar/link $ realpath bar/link /var/tmp/foo $ # Good, we're still under foo $ mv bar/link . $ realpath link /var/tmp $ # Ouch we've escaped So in the end, the only real fix is to ban path based syscalls and pass AT_SYMLINK_NOFOLLOW everywhere. This was the justification for rewriting nearly all 9p local in order to fix CVE-2016-9602. https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg06225.html > [...] > > > > This brings up a new problem I hadn't realized before : the > > > > fchmodat_nofollow() implementation in 9p-local.c is really > > > > a linux only thing to cope with AT_SYMLINK_NOFOLLOW not being > > > > supported with fchmodat(). It looks that this should move to > > > > 9p-util-linux.c and a proper version should be added for macOS > > > > in 9p-util-darwin.c > > > > > > Like already agreed on the other thread, yes, that makes sense. But I > > > think > > > this can be handled with a follow-up, separate from this series. > > > > Fair enough if you want to handle fchmodat_nofollow() later but you > > must at least use fchmodat(AT_SYMLINK_NOFOLLOW) in this patch > > instead of chmod(). > > Sounds like a quick and easy workaround. However looking at 'man fchmodat' on > macOS, this probably does not exactly do what you wanted it to, at least not > in this particular case: > > AT_SYMLINK_NOFOLLOW > If path names a symbolic link, then the mode of the symbolic link is changed. > > AT_SYMLINK_NOFOLLOW_ANY > If path names a symbolic link, then the mode of the symbolic link is changed and > if if the path has any other symbolic links, an error is returned. > > So if somebody replaced the socket file by a 1st order symlink, you would > adjust the symlink's permissions with both AT_SYMLINK_NOFOLLOW as well as with > AT_SYMLINK_NOFOLLOW_ANY. I mean it's better than chmod(), but acceptable? > As long as the link is not followed outside, we're good : it will change the symlink mode and then what ? > Using our existing fchmodat_nofollow() instead is no viable alternative > either, as it uses operations that are not supported on socket files on macOS > (tested). > > Best regards, > Christian Schoenebeck > >
On Mittwoch, 27. April 2022 19:12:15 CEST Will Cohen wrote: > On Wed, Apr 27, 2022 at 12:18 PM Christian Schoenebeck < > > qemu_oss@crudebyte.com> wrote: > > On Mittwoch, 27. April 2022 15:31:42 CEST Greg Kurz wrote: > > > On Wed, 27 Apr 2022 14:32:53 +0200 > > > > > > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote: > > > > On Mittwoch, 27. April 2022 12:18:10 CEST Greg Kurz wrote: > > > > > On Wed, 27 Apr 2022 11:27:28 +0900 > > > > > > > > > > Akihiko Odaki <akihiko.odaki@gmail.com> wrote: > > > > > > On 2022/04/26 21:38, Greg Kurz wrote: > > [...] > > > > > > > > Considering the transient states are tolerated in 9pfs, we need to > > > > > > design this function to be tolerant with transient states as well. > > > > The > > > > > > > > use of chmod() is not safe when we consider about transient > > > > states. A > > > > > > > > malicious actor may replace the file at the path with a symlink > > > > which > > > > > > > > may escape the shared directory and chmod() will naively follow > > > > > > it. > > > > > > > > > > You get a point here. Thanks for your tenacity ! :-) > > > > > > > > Yep, I send a v4 with fchmodat_nofollow() instead of chmod(), thanks! > > > > > > > > BTW, why is it actually allowed for client to create a symlink > > > > pointing > > > > outside exported directory tree with security_model=passthrough/none? > > > > Did > > > > > > anybody want that? > > > > > > The target argument to symlink() is merely a string that is stored in > > > the inode. It is only evaluated as a path at the time an action is > > > made on the link. Checking at symlink() time is thus useless. > > > > > > Anyway, we're safe on this side since it's the client's job to > > > resolve links and we explicitly don't follow them in the server. > > > > I wouldn't call it useless, because it is easier to keep exactly 1 hole > > stuffed instead of being forced to continuously stuff N holes as new ones > > may > > popup up over time, as this case shows (mea culpa). > > > > So you are trading (probably minor) performance for security. > > > > But my question was actually meant seriously: whether there was anybody in > > the > > past who probably actually wanted to create relative symlinks outside the > > exported tree. For instance for another service with wider host filesystem > > access. > > For what it's worth, the inability to follow symlinks read-write outside of > the tree appears to be, at the moment, the primary pain point for new users > of 9pfs in containerization software (see the later comments in > https://github.com/lima-vm/lima/pull/726 and to a lesser extent > https://github.com/containers/podman/issues/13784). > > To my knowledge, neither podman nor lima have come up with conclusive > preferred solutions for how to address this, much less had a robust > discussion with the QEMU team. > The lima discussion notes that it works read-only with passthrough/none, so > I'd suggest that if there weren't users of it before, there are now! As I > understand it, one partial solution for downstream software that allows for > read-write may just be to more proactively mount larger directories to > minimize the number of external paths that symlinks might get tripped up > on. That said, this will stop working when it comes to linking to > additional mounts, since /Volumes on darwin will never line up with /mnt. That's a different thing. People in those discussions were using security_model=mapped where symlinks on guest are virtually mapped as textual file content (try 'cat <symlink>' on host). So in this mode symlinks on host and symlinks on guest are intentionally separated from each other. The issue I was referring to was about security_model=passthrough|none which has the exact same symlinks accessible both on host and guest side, and more specifically I meant: symlinks created by guest that would point to a location *above* the 9p export root. E.g. say guest has access to the following host directory via 9p, that is access *below* the following directory on host: /vm/foo/ And say guest now mounts that host directory and creates a symlink like: mount -t 9p foo /mnt cd /mnt ln -s ../bar bar That symlink would now point to /bar from guest's PoV, and to /vm/bar from host's PoV (i.e. a location on host where guest should not have access to at all). BTW some of the other issues mentioned in the linked discussion, like the timeout errors, are fixed with this patch set. Best regards, Christian Schoenebeck
On Mittwoch, 27. April 2022 19:37:39 CEST Greg Kurz wrote: > On Wed, 27 Apr 2022 18:18:31 +0200 > > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote: > > On Mittwoch, 27. April 2022 15:31:42 CEST Greg Kurz wrote: > > > On Wed, 27 Apr 2022 14:32:53 +0200 > > > > > > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote: > > > > On Mittwoch, 27. April 2022 12:18:10 CEST Greg Kurz wrote: > > > > > On Wed, 27 Apr 2022 11:27:28 +0900 > > > > > > > > > > Akihiko Odaki <akihiko.odaki@gmail.com> wrote: > > > > > > On 2022/04/26 21:38, Greg Kurz wrote: > > [...] > > > > > > > > Considering the transient states are tolerated in 9pfs, we need to > > > > > > design this function to be tolerant with transient states as well. > > > > > > The > > > > > > use of chmod() is not safe when we consider about transient > > > > > > states. A > > > > > > malicious actor may replace the file at the path with a symlink > > > > > > which > > > > > > may escape the shared directory and chmod() will naively follow > > > > > > it. > > > > > > > > > > You get a point here. Thanks for your tenacity ! :-) > > > > > > > > Yep, I send a v4 with fchmodat_nofollow() instead of chmod(), thanks! > > > > > > > > BTW, why is it actually allowed for client to create a symlink > > > > pointing > > > > outside exported directory tree with security_model=passthrough/none? > > > > Did > > > > anybody want that? > > > > > > The target argument to symlink() is merely a string that is stored in > > > the inode. It is only evaluated as a path at the time an action is > > > made on the link. Checking at symlink() time is thus useless. > > > > > > Anyway, we're safe on this side since it's the client's job to > > > resolve links and we explicitly don't follow them in the server. > > > > I wouldn't call it useless, because it is easier to keep exactly 1 hole > > stuffed instead of being forced to continuously stuff N holes as new ones > > may popup up over time, as this case shows (mea culpa). > > > > So you are trading (probably minor) performance for security. > > > > But my question was actually meant seriously: whether there was anybody in > > the past who probably actually wanted to create relative symlinks outside > > the exported tree. For instance for another service with wider host > > filesystem access. > > I took the question seriously :-), the problem is that even if you > do a check on the target at symlink() time, you don't know how it > will be evaluated in the end. > > Quick demonstration: > > $ cd /var/tmp/ > $ mkdir foo > $ cd foo/ > $ # Suppose foo is the jail > $ mkdir bar > $ ln -sf .. bar/link > $ realpath bar/link > /var/tmp/foo > $ # Good, we're still under foo > $ mv bar/link . > $ realpath link > /var/tmp > $ # Ouch we've escaped > > So in the end, the only real fix is to ban path based syscalls and > pass AT_SYMLINK_NOFOLLOW everywhere. This was the justification for > rewriting nearly all 9p local in order to fix CVE-2016-9602. > > https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg06225.html Touché :) Agreed, it's not worth it. I mean this simple example could still be addressed by catching the move, but if you have like several nested directories, each with a huge number of chained symlinks, on top of it non-atomic issues etc., then things would get way expensive to check, as you would actually have to traverse an entire tree and validate an even bigger amount of symlink pathes for every single symlink modification attempt on guest, probably even with exclusive locks, and so on. > > [...] > > > > > > > This brings up a new problem I hadn't realized before : the > > > > > fchmodat_nofollow() implementation in 9p-local.c is really > > > > > a linux only thing to cope with AT_SYMLINK_NOFOLLOW not being > > > > > supported with fchmodat(). It looks that this should move to > > > > > 9p-util-linux.c and a proper version should be added for macOS > > > > > in 9p-util-darwin.c > > > > > > > > Like already agreed on the other thread, yes, that makes sense. But I > > > > think > > > > this can be handled with a follow-up, separate from this series. > > > > > > Fair enough if you want to handle fchmodat_nofollow() later but you > > > must at least use fchmodat(AT_SYMLINK_NOFOLLOW) in this patch > > > instead of chmod(). > > > > Sounds like a quick and easy workaround. However looking at 'man fchmodat' > > on macOS, this probably does not exactly do what you wanted it to, at > > least not> > > in this particular case: > > AT_SYMLINK_NOFOLLOW > > > > If path names a symbolic link, then the mode of the symbolic > > link is changed.> > > AT_SYMLINK_NOFOLLOW_ANY > > > > If path names a symbolic link, then the mode of the symbolic > > link is changed and if if the path has any other symbolic > > links, an error is returned.> > > So if somebody replaced the socket file by a 1st order symlink, you would > > adjust the symlink's permissions with both AT_SYMLINK_NOFOLLOW as well as > > with AT_SYMLINK_NOFOLLOW_ANY. I mean it's better than chmod(), but > > acceptable? > As long as the link is not followed outside, we're good : it will change the > symlink mode and then what ? OK, then fine with me. I already filed a bug report with Apple for supporting fchmod(socket()). Hopefully we'll have a clean atomic solution in near future for this issue then. I'll post v4 now. Thanks! Best regards, Christian Schoenebeck
diff --git a/hw/9pfs/9p-util-darwin.c b/hw/9pfs/9p-util-darwin.c index e24d09763a..39308f2a45 100644 --- a/hw/9pfs/9p-util-darwin.c +++ b/hw/9pfs/9p-util-darwin.c @@ -74,6 +74,27 @@ int fsetxattrat_nofollow(int dirfd, const char *filename, const char *name, */ #if defined CONFIG_PTHREAD_FCHDIR_NP +static int create_socket_file_at_cwd(const char *filename, mode_t mode) { + int fd, err; + struct sockaddr_un addr = { + .sun_family = AF_UNIX + }; + + fd = socket(PF_UNIX, SOCK_DGRAM, 0); + if (fd == -1) { + return fd; + } + snprintf(addr.sun_path, sizeof(addr.sun_path), "./%s", filename); + err = bind(fd, (struct sockaddr *) &addr, sizeof(addr)); + if (err == -1) { + goto out; + } + err = chmod(addr.sun_path, mode); +out: + close(fd); + return err; +} + int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t dev) { int preserved_errno, err; @@ -93,7 +114,11 @@ int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t dev) if (pthread_fchdir_np(dirfd) < 0) { return -1; } - err = mknod(filename, mode, dev); + if (S_ISSOCK(mode)) { + err = create_socket_file_at_cwd(filename, mode); + } else { + err = mknod(filename, mode, dev); + } preserved_errno = errno; /* Stop using the thread-local cwd */ pthread_fchdir_np(-1);