[v2,2/5] 9pfs: fix qemu_mknodat(S_IFSOCK) on macOS

Message ID	f6d632fc82d4750b73c83a2f1d1b9972cf3e26bb.1650553693.git.qemu_oss@crudebyte.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org> Message-Id: <f6d632fc82d4750b73c83a2f1d1b9972cf3e26bb.1650553693.git.qemu_oss@crudebyte.com> In-Reply-To: <cover.1650553693.git.qemu_oss@crudebyte.com> References: <cover.1650553693.git.qemu_oss@crudebyte.com> From: Christian Schoenebeck <qemu_oss@crudebyte.com> Date: Thu, 21 Apr 2022 17:07:43 +0200 Subject: [PATCH v2 2/5] 9pfs: fix qemu_mknodat(S_IFSOCK) on macOS To: qemu-devel@nongnu.org Cc: Will Cohen <wwcohen@gmail.com>, Greg Kurz <groug@kaod.org>, Michael Roitzsch <reactorcontrol@icloud.com>, Keno Fischer <keno@juliacomputing.com>, Akihiko Odaki <akihiko.odaki@gmail.com>, qemu-stable@nongnu.org Received-SPF: none client-ip=91.194.90.13; envelope-from=f6d632fc82d4750b73c83a2f1d1b9972cf3e26bb@lizzy.crudebyte.com; helo=lizzy.crudebyte.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, SPF_HELO_NONE=0.001, SPF_NONE=0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action Precedence: list Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org>
Series	9pfs: macOS host fixes \| expand [v2,0/5] 9pfs: macOS host fixes [v2,1/5] 9pfs: fix qemu_mknodat(S_IFREG) on macOS [v2,2/5] 9pfs: fix qemu_mknodat(S_IFSOCK) on macOS [v2,3/5] 9pfs: fix wrong encoding of rdev field in Rgetattr on macOS [v2,4/5] 9pfs: fix wrong errno being sent to Linux client on macOS host [v2,5/5] 9pfs: fix removing non-existent POSIX ACL xattr on macOS host

Christian Schoenebeck April 21, 2022, 3:07 p.m. UTC

mknod() on macOS does not support creating sockets, so divert to
call sequence socket(), bind() and chmod() respectively if S_IFSOCK
was passed with mode argument.

Link: https://lore.kernel.org/qemu-devel/17933734.zYzKuhC07K@silver/
Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
Reviewed-by: Will Cohen <wwcohen@gmail.com>
---
 hw/9pfs/9p-util-darwin.c | 27 ++++++++++++++++++++++++++-
 1 file changed, 26 insertions(+), 1 deletion(-)

Greg Kurz April 21, 2022, 4:36 p.m. UTC | #1

On Thu, 21 Apr 2022 17:07:43 +0200
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> mknod() on macOS does not support creating sockets, so divert to
> call sequence socket(), bind() and chmod() respectively if S_IFSOCK
> was passed with mode argument.
> 
> Link: https://lore.kernel.org/qemu-devel/17933734.zYzKuhC07K@silver/
> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> Reviewed-by: Will Cohen <wwcohen@gmail.com>
> ---
>  hw/9pfs/9p-util-darwin.c | 27 ++++++++++++++++++++++++++-
>  1 file changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/9pfs/9p-util-darwin.c b/hw/9pfs/9p-util-darwin.c
> index e24d09763a..39308f2a45 100644
> --- a/hw/9pfs/9p-util-darwin.c
> +++ b/hw/9pfs/9p-util-darwin.c
> @@ -74,6 +74,27 @@ int fsetxattrat_nofollow(int dirfd, const char *filename, const char *name,
>   */
>  #if defined CONFIG_PTHREAD_FCHDIR_NP
>  
> +static int create_socket_file_at_cwd(const char *filename, mode_t mode) {
> +    int fd, err;
> +    struct sockaddr_un addr = {
> +        .sun_family = AF_UNIX
> +    };
> +
> +    fd = socket(PF_UNIX, SOCK_DGRAM, 0);
> +    if (fd == -1) {
> +        return fd;
> +    }
> +    snprintf(addr.sun_path, sizeof(addr.sun_path), "./%s", filename);
> +    err = bind(fd, (struct sockaddr *) &addr, sizeof(addr));
> +    if (err == -1) {
> +        goto out;
> +    }
> +    err = chmod(addr.sun_path, mode);
> +out:
> +    close(fd);

You need close_preserve_errno() here.

Rest LGTM, so with that fixed, you can add:

Reviewed-by: Greg Kurz <groug@kaod.org>

> +    return err;
> +}
> +
>  int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t dev)
>  {
>      int preserved_errno, err;
> @@ -93,7 +114,11 @@ int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t dev)
>      if (pthread_fchdir_np(dirfd) < 0) {
>          return -1;
>      }
> -    err = mknod(filename, mode, dev);
> +    if (S_ISSOCK(mode)) {
> +        err = create_socket_file_at_cwd(filename, mode);
> +    } else {
> +        err = mknod(filename, mode, dev);
> +    }
>      preserved_errno = errno;
>      /* Stop using the thread-local cwd */
>      pthread_fchdir_np(-1);

Christian Schoenebeck April 21, 2022, 5:29 p.m. UTC | #2

On Donnerstag, 21. April 2022 18:36:31 CEST Greg Kurz wrote:
> On Thu, 21 Apr 2022 17:07:43 +0200
> 
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > mknod() on macOS does not support creating sockets, so divert to
> > call sequence socket(), bind() and chmod() respectively if S_IFSOCK
> > was passed with mode argument.
> > 
> > Link: https://lore.kernel.org/qemu-devel/17933734.zYzKuhC07K@silver/
> > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> > Reviewed-by: Will Cohen <wwcohen@gmail.com>
> > ---
> > 
> >  hw/9pfs/9p-util-darwin.c | 27 ++++++++++++++++++++++++++-
> >  1 file changed, 26 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/9pfs/9p-util-darwin.c b/hw/9pfs/9p-util-darwin.c
> > index e24d09763a..39308f2a45 100644
> > --- a/hw/9pfs/9p-util-darwin.c
> > +++ b/hw/9pfs/9p-util-darwin.c
> > @@ -74,6 +74,27 @@ int fsetxattrat_nofollow(int dirfd, const char
> > *filename, const char *name,> 
> >   */
> >  
> >  #if defined CONFIG_PTHREAD_FCHDIR_NP
> > 
> > +static int create_socket_file_at_cwd(const char *filename, mode_t mode) {
> > +    int fd, err;
> > +    struct sockaddr_un addr = {
> > +        .sun_family = AF_UNIX
> > +    };
> > +
> > +    fd = socket(PF_UNIX, SOCK_DGRAM, 0);
> > +    if (fd == -1) {
> > +        return fd;
> > +    }
> > +    snprintf(addr.sun_path, sizeof(addr.sun_path), "./%s", filename);
> > +    err = bind(fd, (struct sockaddr *) &addr, sizeof(addr));
> > +    if (err == -1) {
> > +        goto out;
> > +    }
> > +    err = chmod(addr.sun_path, mode);
> > +out:
> > +    close(fd);
> 
> You need close_preserve_errno() here.
> 
> Rest LGTM, so with that fixed, you can add:
> 
> Reviewed-by: Greg Kurz <groug@kaod.org>

Right, unlike patch 1, we might come from an error path here when closing.

I'll just s/close/close_preserve_errno/ this on my end before queuing, without 
sending a v3, unless somebody finds something else in this series.

Thanks!

Best regards,
Christian Schoenebeck

Akihiko Odaki April 22, 2022, 2:43 a.m. UTC | #3

On 2022/04/22 0:07, Christian Schoenebeck wrote:
> mknod() on macOS does not support creating sockets, so divert to
> call sequence socket(), bind() and chmod() respectively if S_IFSOCK
> was passed with mode argument.
> 
> Link: https://lore.kernel.org/qemu-devel/17933734.zYzKuhC07K@silver/
> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> Reviewed-by: Will Cohen <wwcohen@gmail.com>
> ---
>   hw/9pfs/9p-util-darwin.c | 27 ++++++++++++++++++++++++++-
>   1 file changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/9pfs/9p-util-darwin.c b/hw/9pfs/9p-util-darwin.c
> index e24d09763a..39308f2a45 100644
> --- a/hw/9pfs/9p-util-darwin.c
> +++ b/hw/9pfs/9p-util-darwin.c
> @@ -74,6 +74,27 @@ int fsetxattrat_nofollow(int dirfd, const char *filename, const char *name,
>    */
>   #if defined CONFIG_PTHREAD_FCHDIR_NP
>   
> +static int create_socket_file_at_cwd(const char *filename, mode_t mode) {
> +    int fd, err;
> +    struct sockaddr_un addr = {
> +        .sun_family = AF_UNIX
> +    };
> +
> +    fd = socket(PF_UNIX, SOCK_DGRAM, 0);
> +    if (fd == -1) {
> +        return fd;
> +    }
> +    snprintf(addr.sun_path, sizeof(addr.sun_path), "./%s", filename);

It would result in an incorrect path if the path does not fit in 
addr.sun_path. It should report an explicit error instead.

> +    err = bind(fd, (struct sockaddr *) &addr, sizeof(addr));
> +    if (err == -1) {
> +        goto out;

You may close(fd) as soon as bind() returns (before checking the 
returned value) and eliminate goto.

> +    }
> +    err = chmod(addr.sun_path, mode);

I'm not sure if it is fine to have a time window between bind() and 
chmod(). Do you have some rationale?

Regards,
Akihiko Odaki

> +out:
> +    close(fd);
> +    return err;
> +}
> +
>   int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t dev)
>   {
>       int preserved_errno, err;
> @@ -93,7 +114,11 @@ int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t dev)
>       if (pthread_fchdir_np(dirfd) < 0) {
>           return -1;
>       }
> -    err = mknod(filename, mode, dev);
> +    if (S_ISSOCK(mode)) {
> +        err = create_socket_file_at_cwd(filename, mode);
> +    } else {
> +        err = mknod(filename, mode, dev);
> +    }
>       preserved_errno = errno;
>       /* Stop using the thread-local cwd */
>       pthread_fchdir_np(-1);

Christian Schoenebeck April 22, 2022, 2:06 p.m. UTC | #4

On Freitag, 22. April 2022 04:43:40 CEST Akihiko Odaki wrote:
> On 2022/04/22 0:07, Christian Schoenebeck wrote:
> > mknod() on macOS does not support creating sockets, so divert to
> > call sequence socket(), bind() and chmod() respectively if S_IFSOCK
> > was passed with mode argument.
> > 
> > Link: https://lore.kernel.org/qemu-devel/17933734.zYzKuhC07K@silver/
> > Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> > Reviewed-by: Will Cohen <wwcohen@gmail.com>
> > ---
> > 
> >   hw/9pfs/9p-util-darwin.c | 27 ++++++++++++++++++++++++++-
> >   1 file changed, 26 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/9pfs/9p-util-darwin.c b/hw/9pfs/9p-util-darwin.c
> > index e24d09763a..39308f2a45 100644
> > --- a/hw/9pfs/9p-util-darwin.c
> > +++ b/hw/9pfs/9p-util-darwin.c
> > @@ -74,6 +74,27 @@ int fsetxattrat_nofollow(int dirfd, const char
> > *filename, const char *name,> 
> >    */
> >   
> >   #if defined CONFIG_PTHREAD_FCHDIR_NP
> > 
> > +static int create_socket_file_at_cwd(const char *filename, mode_t mode) {
> > +    int fd, err;
> > +    struct sockaddr_un addr = {
> > +        .sun_family = AF_UNIX
> > +    };
> > +
> > +    fd = socket(PF_UNIX, SOCK_DGRAM, 0);
> > +    if (fd == -1) {
> > +        return fd;
> > +    }
> > +    snprintf(addr.sun_path, sizeof(addr.sun_path), "./%s", filename);
> 
> It would result in an incorrect path if the path does not fit in
> addr.sun_path. It should report an explicit error instead.

Looking at its header file, 'sun_path' is indeed defined on macOS with an 
oddly small size of only 104 bytes. So yes, I should explicitly handle that 
error case.

I'll post a v3.

> > +    err = bind(fd, (struct sockaddr *) &addr, sizeof(addr));
> > +    if (err == -1) {
> > +        goto out;
> 
> You may close(fd) as soon as bind() returns (before checking the
> returned value) and eliminate goto.

Yeah, I thought about that alternative, but found it a bit ugly, and probably 
also counter-productive in case this function might get extended with more 
error pathes in future. Not that I would insist on the current solution 
though.

> > +    }
> > +    err = chmod(addr.sun_path, mode);
> 
> I'm not sure if it is fine to have a time window between bind() and
> chmod(). Do you have some rationale?

Good question. QEMU's 9p server is multi-threaded; all 9p requests come in 
serialized and the 9p server controller portion (9p.c) is only running on QEMU 
main thread, but the actual filesystem driver calls are then dispatched to 
QEMU worker threads and therefore running concurrently at this point:

https://wiki.qemu.org/Documentation/9p#Threads_and_Coroutines

Similar situation on Linux 9p client side: it handles access to a mounted 9p 
filesystem concurrently, requests are then serialized by 9p driver on Linux 
and sent over wire to 9p server (host).

So yes, there might be implications by that short time windows. But could that 
be exploited on macOS hosts in practice?

The socket file would have mode srwxr-xr-x for a short moment.

For security_model=mapped* this should not be a problem.

For security_model=none|passhrough, in theory, maybe? But how likely is that? 
If you are using a Linux client for instance, trying to brute-force opening 
the socket file, the client would send several 9p commands (Twalk, Tgetattr, 
Topen, probably more). The time window of the two commands above should be 
much smaller than that and I would expect one of the 9p commands to error out 
in between.

What would be a viable approach to avoid this issue on macOS?

> > +out:
> > +    close(fd);
> > +    return err;
> > +}
> > +
> > 
> >   int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t
> >   dev)
> >   {
> >   
> >       int preserved_errno, err;
> > 
> > @@ -93,7 +114,11 @@ int qemu_mknodat(int dirfd, const char *filename,
> > mode_t mode, dev_t dev)> 
> >       if (pthread_fchdir_np(dirfd) < 0) {
> >       
> >           return -1;
> >       
> >       }
> > 
> > -    err = mknod(filename, mode, dev);
> > +    if (S_ISSOCK(mode)) {
> > +        err = create_socket_file_at_cwd(filename, mode);
> > +    } else {
> > +        err = mknod(filename, mode, dev);
> > +    }
> > 
> >       preserved_errno = errno;
> >       /* Stop using the thread-local cwd */
> >       pthread_fchdir_np(-1);

Akihiko Odaki April 23, 2022, 4:33 a.m. UTC | #5

On 2022/04/22 23:06, Christian Schoenebeck wrote:
> On Freitag, 22. April 2022 04:43:40 CEST Akihiko Odaki wrote:
>> On 2022/04/22 0:07, Christian Schoenebeck wrote:
>>> mknod() on macOS does not support creating sockets, so divert to
>>> call sequence socket(), bind() and chmod() respectively if S_IFSOCK
>>> was passed with mode argument.
>>>
>>> Link: https://lore.kernel.org/qemu-devel/17933734.zYzKuhC07K@silver/
>>> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
>>> Reviewed-by: Will Cohen <wwcohen@gmail.com>
>>> ---
>>>
>>>    hw/9pfs/9p-util-darwin.c | 27 ++++++++++++++++++++++++++-
>>>    1 file changed, 26 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/9pfs/9p-util-darwin.c b/hw/9pfs/9p-util-darwin.c
>>> index e24d09763a..39308f2a45 100644
>>> --- a/hw/9pfs/9p-util-darwin.c
>>> +++ b/hw/9pfs/9p-util-darwin.c
>>> @@ -74,6 +74,27 @@ int fsetxattrat_nofollow(int dirfd, const char
>>> *filename, const char *name,>
>>>     */
>>>    
>>>    #if defined CONFIG_PTHREAD_FCHDIR_NP
>>>
>>> +static int create_socket_file_at_cwd(const char *filename, mode_t mode) {
>>> +    int fd, err;
>>> +    struct sockaddr_un addr = {
>>> +        .sun_family = AF_UNIX
>>> +    };
>>> +
>>> +    fd = socket(PF_UNIX, SOCK_DGRAM, 0);
>>> +    if (fd == -1) {
>>> +        return fd;
>>> +    }
>>> +    snprintf(addr.sun_path, sizeof(addr.sun_path), "./%s", filename);
>>
>> It would result in an incorrect path if the path does not fit in
>> addr.sun_path. It should report an explicit error instead.
> 
> Looking at its header file, 'sun_path' is indeed defined on macOS with an
> oddly small size of only 104 bytes. So yes, I should explicitly handle that
> error case.
> 
> I'll post a v3.
> 
>>> +    err = bind(fd, (struct sockaddr *) &addr, sizeof(addr));
>>> +    if (err == -1) {
>>> +        goto out;
>>
>> You may close(fd) as soon as bind() returns (before checking the
>> returned value) and eliminate goto.
> 
> Yeah, I thought about that alternative, but found it a bit ugly, and probably
> also counter-productive in case this function might get extended with more
> error pathes in future. Not that I would insist on the current solution
> though.

I'm happy with the explanation. Thanks.

> 
>>> +    }
>>> +    err = chmod(addr.sun_path, mode);
>>
>> I'm not sure if it is fine to have a time window between bind() and
>> chmod(). Do you have some rationale?
> 
> Good question. QEMU's 9p server is multi-threaded; all 9p requests come in
> serialized and the 9p server controller portion (9p.c) is only running on QEMU
> main thread, but the actual filesystem driver calls are then dispatched to
> QEMU worker threads and therefore running concurrently at this point:
> 
> https://wiki.qemu.org/Documentation/9p#Threads_and_Coroutines
> 
> Similar situation on Linux 9p client side: it handles access to a mounted 9p
> filesystem concurrently, requests are then serialized by 9p driver on Linux
> and sent over wire to 9p server (host).
> 
> So yes, there might be implications by that short time windows. But could that
> be exploited on macOS hosts in practice?
> 
> The socket file would have mode srwxr-xr-x for a short moment.
> 
> For security_model=mapped* this should not be a problem.
> 
> For security_model=none|passhrough, in theory, maybe? But how likely is that?
> If you are using a Linux client for instance, trying to brute-force opening
> the socket file, the client would send several 9p commands (Twalk, Tgetattr,
> Topen, probably more). The time window of the two commands above should be
> much smaller than that and I would expect one of the 9p commands to error out
> in between.
> 
> What would be a viable approach to avoid this issue on macOS?

It is unlikely that a naive brute-force approach will succeed to 
exploit. The more concerning scenario is that the attacker uses the 
knowledge of the underlying implementation of macOS to cause resource 
contention to widen the window. Whether an exploitation is viable 
depends on how much time you spend digging XNU.

However, I'm also not sure if it really *has* a race condition. Looking 
at v9fs_co_mknod(), it sequentially calls s->ops->mknod() and 
s->ops->lstat(). It also results in an entity called "path name based 
fid" in the code, which inherently cannot identify a file when it is 
renamed or recreated.

If there is some rationale it is safe, it may also be applied to the 
sequence of bind() and chmod(). Can anyone explain the sequence of 
s->ops->mknod() and s->ops->lstat() or path name based fid in general?

Regards,
Akihiko Odaki

> 
>>> +out:
>>> +    close(fd);
>>> +    return err;
>>> +}
>>> +
>>>
>>>    int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t
>>>    dev)
>>>    {
>>>    
>>>        int preserved_errno, err;
>>>
>>> @@ -93,7 +114,11 @@ int qemu_mknodat(int dirfd, const char *filename,
>>> mode_t mode, dev_t dev)>
>>>        if (pthread_fchdir_np(dirfd) < 0) {
>>>        
>>>            return -1;
>>>        
>>>        }
>>>
>>> -    err = mknod(filename, mode, dev);
>>> +    if (S_ISSOCK(mode)) {
>>> +        err = create_socket_file_at_cwd(filename, mode);
>>> +    } else {
>>> +        err = mknod(filename, mode, dev);
>>> +    }
>>>
>>>        preserved_errno = errno;
>>>        /* Stop using the thread-local cwd */
>>>        pthread_fchdir_np(-1);
> 
>

Christian Schoenebeck April 24, 2022, 6:45 p.m. UTC | #6

On Samstag, 23. April 2022 06:33:50 CEST Akihiko Odaki wrote:
> On 2022/04/22 23:06, Christian Schoenebeck wrote:
> > On Freitag, 22. April 2022 04:43:40 CEST Akihiko Odaki wrote:
> >> On 2022/04/22 0:07, Christian Schoenebeck wrote:
> >>> mknod() on macOS does not support creating sockets, so divert to
> >>> call sequence socket(), bind() and chmod() respectively if S_IFSOCK
> >>> was passed with mode argument.
> >>> 
> >>> Link: https://lore.kernel.org/qemu-devel/17933734.zYzKuhC07K@silver/
> >>> Signed-off-by: Christian Schoenebeck <qemu_oss@crudebyte.com>
> >>> Reviewed-by: Will Cohen <wwcohen@gmail.com>
> >>> ---
> >>> 
> >>>    hw/9pfs/9p-util-darwin.c | 27 ++++++++++++++++++++++++++-
> >>>    1 file changed, 26 insertions(+), 1 deletion(-)
> >>> 
> >>> diff --git a/hw/9pfs/9p-util-darwin.c b/hw/9pfs/9p-util-darwin.c
> >>> index e24d09763a..39308f2a45 100644
> >>> --- a/hw/9pfs/9p-util-darwin.c
> >>> +++ b/hw/9pfs/9p-util-darwin.c
> >>> @@ -74,6 +74,27 @@ int fsetxattrat_nofollow(int dirfd, const char
> >>> *filename, const char *name,>
> >>> 
> >>>     */
> >>>    
> >>>    #if defined CONFIG_PTHREAD_FCHDIR_NP
> >>> 
> >>> +static int create_socket_file_at_cwd(const char *filename, mode_t mode)
> >>> {
> >>> +    int fd, err;
> >>> +    struct sockaddr_un addr = {
> >>> +        .sun_family = AF_UNIX
> >>> +    };
> >>> +
> >>> +    fd = socket(PF_UNIX, SOCK_DGRAM, 0);
> >>> +    if (fd == -1) {
> >>> +        return fd;
> >>> +    }
> >>> +    snprintf(addr.sun_path, sizeof(addr.sun_path), "./%s", filename);
> >> 
> >> It would result in an incorrect path if the path does not fit in
> >> addr.sun_path. It should report an explicit error instead.
> > 
> > Looking at its header file, 'sun_path' is indeed defined on macOS with an
> > oddly small size of only 104 bytes. So yes, I should explicitly handle
> > that
> > error case.
> > 
> > I'll post a v3.
> > 
> >>> +    err = bind(fd, (struct sockaddr *) &addr, sizeof(addr));
> >>> +    if (err == -1) {
> >>> +        goto out;
> >> 
> >> You may close(fd) as soon as bind() returns (before checking the
> >> returned value) and eliminate goto.
> > 
> > Yeah, I thought about that alternative, but found it a bit ugly, and
> > probably also counter-productive in case this function might get extended
> > with more error pathes in future. Not that I would insist on the current
> > solution though.
> 
> I'm happy with the explanation. Thanks.
> 
> >>> +    }
> >>> +    err = chmod(addr.sun_path, mode);
> >> 
> >> I'm not sure if it is fine to have a time window between bind() and
> >> chmod(). Do you have some rationale?
> > 
> > Good question. QEMU's 9p server is multi-threaded; all 9p requests come in
> > serialized and the 9p server controller portion (9p.c) is only running on
> > QEMU main thread, but the actual filesystem driver calls are then
> > dispatched to QEMU worker threads and therefore running concurrently at
> > this point:
> > 
> > https://wiki.qemu.org/Documentation/9p#Threads_and_Coroutines
> > 
> > Similar situation on Linux 9p client side: it handles access to a mounted
> > 9p filesystem concurrently, requests are then serialized by 9p driver on
> > Linux and sent over wire to 9p server (host).
> > 
> > So yes, there might be implications by that short time windows. But could
> > that be exploited on macOS hosts in practice?
> > 
> > The socket file would have mode srwxr-xr-x for a short moment.
> > 
> > For security_model=mapped* this should not be a problem.
> > 
> > For security_model=none|passhrough, in theory, maybe? But how likely is
> > that? If you are using a Linux client for instance, trying to brute-force
> > opening the socket file, the client would send several 9p commands
> > (Twalk, Tgetattr, Topen, probably more). The time window of the two
> > commands above should be much smaller than that and I would expect one of
> > the 9p commands to error out in between.
> > 
> > What would be a viable approach to avoid this issue on macOS?
> 
> It is unlikely that a naive brute-force approach will succeed to
> exploit. The more concerning scenario is that the attacker uses the
> knowledge of the underlying implementation of macOS to cause resource
> contention to widen the window. Whether an exploitation is viable
> depends on how much time you spend digging XNU.
> 
> However, I'm also not sure if it really *has* a race condition. Looking
> at v9fs_co_mknod(), it sequentially calls s->ops->mknod() and
> s->ops->lstat(). It also results in an entity called "path name based
> fid" in the code, which inherently cannot identify a file when it is
> renamed or recreated.
> 
> If there is some rationale it is safe, it may also be applied to the
> sequence of bind() and chmod(). Can anyone explain the sequence of
> s->ops->mknod() and s->ops->lstat() or path name based fid in general?

You are talking about 9p server's controller level: I don't see something that 
would prevent a concurrent open() during this bind() ... chmod() time window 
unfortunately.

Argument 'fidp' passed to function v9fs_co_mknod() reflects the directory in 
which the new device file shall be created. So 'fidp' is not the device file 
here, nor is 'fidp' modified during this function.

Function v9fs_co_mknod() is entered by 9p server on QEMU main thread. At the 
beginning of the function it first acquires a read lock on a (per 9p export) 
global coroutine mutex:

    v9fs_path_read_lock(s);

and holds this lock until returning from function v9fs_co_mknod(). But that's 
just a read lock. Function v9fs_co_open() also just gains a read lock. So they 
can happen concurrently.

Then v9fs_co_run_in_worker({...}) is called to dispatch and execute all the 
code block (think of it as an Obj-C "block") inside this (macro actually) on a 
QEMU worker thread. So an arbitrary background thread would then call the fs 
driver functions:

    s->ops->mknod()
    v9fs_name_to_path()
    s->ops->lstat()

and then at the end of the code block the background thread would dispatch 
back to QEMU main thread. So when we are reaching:

    v9fs_path_unlock(s);

we are already back on QEMU main thread, hence unlocking on main thread now 
and finally leaving function v9fs_co_mknod().

The important thing to understand is, while that

    v9fs_co_run_in_worker({...})

code block is executed on a QEMU worker thread, the QEMU main thread (9p 
server controller portion, i.e. 9p.c) is *not* sleeping, QEMU main thread 
rather continues to process other (if any) client requests in the meantime. In 
other words v9fs_co_run_in_worker() neither behaves exactly like Apple's GCD 
dispatch_async(), nor like dispatch_sync(), as GCD is not coroutine based.

So 9p server might pull a pending 'Topen' client request from the input FIFO 
in the meantime and likewise dispatch that to a worker thread, etc. Hence a 
concurrent open() might in theory be possible, but I find it quite unlikely to 
succeed in practice as the open() call on guest is translated by Linux client 
into a bunch of synchronous 9p requests on the path passed with the open() 
call on guest, and a round trip for each 9p message is like what, ~0.3ms or 
something in this order. That's quite huge compared to the time window I would 
expect between bind() ... open().

Does this answer your questions?

> Regards,
> Akihiko Odaki
> 
> >>> +out:
> >>> +    close(fd);
> >>> +    return err;
> >>> +}
> >>> +
> >>> 
> >>>    int qemu_mknodat(int dirfd, const char *filename, mode_t mode, dev_t
> >>>    dev)
> >>>    {
> >>>    
> >>>        int preserved_errno, err;
> >>> 
> >>> @@ -93,7 +114,11 @@ int qemu_mknodat(int dirfd, const char *filename,
> >>> mode_t mode, dev_t dev)>
> >>> 
> >>>        if (pthread_fchdir_np(dirfd) < 0) {
> >>>        
> >>>            return -1;
> >>>        
> >>>        }
> >>> 
> >>> -    err = mknod(filename, mode, dev);
> >>> +    if (S_ISSOCK(mode)) {
> >>> +        err = create_socket_file_at_cwd(filename, mode);
> >>> +    } else {
> >>> +        err = mknod(filename, mode, dev);
> >>> +    }
> >>> 
> >>>        preserved_errno = errno;
> >>>        /* Stop using the thread-local cwd */
> >>>        pthread_fchdir_np(-1);

Akihiko Odaki April 26, 2022, 3:57 a.m. UTC | #7

On 2022/04/25 3:45, Christian Schoenebeck wrote:
>>>>> +    }
>>>>> +    err = chmod(addr.sun_path, mode);
>>>>
>>>> I'm not sure if it is fine to have a time window between bind() and
>>>> chmod(). Do you have some rationale?
>>>
>>> Good question. QEMU's 9p server is multi-threaded; all 9p requests come in
>>> serialized and the 9p server controller portion (9p.c) is only running on
>>> QEMU main thread, but the actual filesystem driver calls are then
>>> dispatched to QEMU worker threads and therefore running concurrently at
>>> this point:
>>>
>>> https://wiki.qemu.org/Documentation/9p#Threads_and_Coroutines
>>>
>>> Similar situation on Linux 9p client side: it handles access to a mounted
>>> 9p filesystem concurrently, requests are then serialized by 9p driver on
>>> Linux and sent over wire to 9p server (host).
>>>
>>> So yes, there might be implications by that short time windows. But could
>>> that be exploited on macOS hosts in practice?
>>>
>>> The socket file would have mode srwxr-xr-x for a short moment.
>>>
>>> For security_model=mapped* this should not be a problem.
>>>
>>> For security_model=none|passhrough, in theory, maybe? But how likely is
>>> that? If you are using a Linux client for instance, trying to brute-force
>>> opening the socket file, the client would send several 9p commands
>>> (Twalk, Tgetattr, Topen, probably more). The time window of the two
>>> commands above should be much smaller than that and I would expect one of
>>> the 9p commands to error out in between.
>>>
>>> What would be a viable approach to avoid this issue on macOS?
>>
>> It is unlikely that a naive brute-force approach will succeed to
>> exploit. The more concerning scenario is that the attacker uses the
>> knowledge of the underlying implementation of macOS to cause resource
>> contention to widen the window. Whether an exploitation is viable
>> depends on how much time you spend digging XNU.
>>
>> However, I'm also not sure if it really *has* a race condition. Looking
>> at v9fs_co_mknod(), it sequentially calls s->ops->mknod() and
>> s->ops->lstat(). It also results in an entity called "path name based
>> fid" in the code, which inherently cannot identify a file when it is
>> renamed or recreated.
>>
>> If there is some rationale it is safe, it may also be applied to the
>> sequence of bind() and chmod(). Can anyone explain the sequence of
>> s->ops->mknod() and s->ops->lstat() or path name based fid in general?
> 
> You are talking about 9p server's controller level: I don't see something that
> would prevent a concurrent open() during this bind() ... chmod() time window
> unfortunately.
> 
> Argument 'fidp' passed to function v9fs_co_mknod() reflects the directory in
> which the new device file shall be created. So 'fidp' is not the device file
> here, nor is 'fidp' modified during this function.
> 
> Function v9fs_co_mknod() is entered by 9p server on QEMU main thread. At the
> beginning of the function it first acquires a read lock on a (per 9p export)
> global coroutine mutex:
> 
>      v9fs_path_read_lock(s);
> 
> and holds this lock until returning from function v9fs_co_mknod(). But that's
> just a read lock. Function v9fs_co_open() also just gains a read lock. So they
> can happen concurrently.
> 
> Then v9fs_co_run_in_worker({...}) is called to dispatch and execute all the
> code block (think of it as an Obj-C "block") inside this (macro actually) on a
> QEMU worker thread. So an arbitrary background thread would then call the fs
> driver functions:
> 
>      s->ops->mknod()
>      v9fs_name_to_path()
>      s->ops->lstat()
> 
> and then at the end of the code block the background thread would dispatch
> back to QEMU main thread. So when we are reaching:
> 
>      v9fs_path_unlock(s);
> 
> we are already back on QEMU main thread, hence unlocking on main thread now
> and finally leaving function v9fs_co_mknod().
> 
> The important thing to understand is, while that
> 
>      v9fs_co_run_in_worker({...})
> 
> code block is executed on a QEMU worker thread, the QEMU main thread (9p
> server controller portion, i.e. 9p.c) is *not* sleeping, QEMU main thread
> rather continues to process other (if any) client requests in the meantime. In
> other words v9fs_co_run_in_worker() neither behaves exactly like Apple's GCD
> dispatch_async(), nor like dispatch_sync(), as GCD is not coroutine based.
> 
> So 9p server might pull a pending 'Topen' client request from the input FIFO
> in the meantime and likewise dispatch that to a worker thread, etc. Hence a
> concurrent open() might in theory be possible, but I find it quite unlikely to
> succeed in practice as the open() call on guest is translated by Linux client
> into a bunch of synchronous 9p requests on the path passed with the open()
> call on guest, and a round trip for each 9p message is like what, ~0.3ms or
> something in this order. That's quite huge compared to the time window I would
> expect between bind() ... open().
> 
> Does this answer your questions?

The time window may be widened by a malicious actor if the actor knows 
XNU well so the window length inferred from experiences is not really 
enough to claim it safe, particularly when considering about security.

On the other hand, I'm wondering if there is same kind of a time window 
between s->ops->mknodat() and s->ops->lstat(). Also, there should be 
similar time windows among operations with "path name based fid" as they 
also use path names as identifiers. If there is a rationale that it is 
considered secure, we may be able to apply the same logic to the time 
window between bind() and chmod() and claim it secure.
I need a review from someone who understands that part of the code, 
therefore.

Regards,
Akihiko Odaki

Greg Kurz April 26, 2022, 12:38 p.m. UTC | #8

On Tue, 26 Apr 2022 12:57:37 +0900
Akihiko Odaki <akihiko.odaki@gmail.com> wrote:

> On 2022/04/25 3:45, Christian Schoenebeck wrote:
> >>>>> +    }
> >>>>> +    err = chmod(addr.sun_path, mode);
> >>>>
> >>>> I'm not sure if it is fine to have a time window between bind() and
> >>>> chmod(). Do you have some rationale?
> >>>
> >>> Good question. QEMU's 9p server is multi-threaded; all 9p requests come in
> >>> serialized and the 9p server controller portion (9p.c) is only running on
> >>> QEMU main thread, but the actual filesystem driver calls are then
> >>> dispatched to QEMU worker threads and therefore running concurrently at
> >>> this point:
> >>>
> >>> https://wiki.qemu.org/Documentation/9p#Threads_and_Coroutines
> >>>
> >>> Similar situation on Linux 9p client side: it handles access to a mounted
> >>> 9p filesystem concurrently, requests are then serialized by 9p driver on
> >>> Linux and sent over wire to 9p server (host).
> >>>
> >>> So yes, there might be implications by that short time windows. But could
> >>> that be exploited on macOS hosts in practice?
> >>>
> >>> The socket file would have mode srwxr-xr-x for a short moment.
> >>>
> >>> For security_model=mapped* this should not be a problem.
> >>>
> >>> For security_model=none|passhrough, in theory, maybe? But how likely is
> >>> that? If you are using a Linux client for instance, trying to brute-force
> >>> opening the socket file, the client would send several 9p commands
> >>> (Twalk, Tgetattr, Topen, probably more). The time window of the two
> >>> commands above should be much smaller than that and I would expect one of
> >>> the 9p commands to error out in between.
> >>>
> >>> What would be a viable approach to avoid this issue on macOS?
> >>
> >> It is unlikely that a naive brute-force approach will succeed to
> >> exploit. The more concerning scenario is that the attacker uses the
> >> knowledge of the underlying implementation of macOS to cause resource
> >> contention to widen the window. Whether an exploitation is viable
> >> depends on how much time you spend digging XNU.
> >>
> >> However, I'm also not sure if it really *has* a race condition. Looking
> >> at v9fs_co_mknod(), it sequentially calls s->ops->mknod() and
> >> s->ops->lstat(). It also results in an entity called "path name based
> >> fid" in the code, which inherently cannot identify a file when it is
> >> renamed or recreated.
> >>
> >> If there is some rationale it is safe, it may also be applied to the
> >> sequence of bind() and chmod(). Can anyone explain the sequence of
> >> s->ops->mknod() and s->ops->lstat() or path name based fid in general?
> > 
> > You are talking about 9p server's controller level: I don't see something that
> > would prevent a concurrent open() during this bind() ... chmod() time window
> > unfortunately.
> > 
> > Argument 'fidp' passed to function v9fs_co_mknod() reflects the directory in
> > which the new device file shall be created. So 'fidp' is not the device file
> > here, nor is 'fidp' modified during this function.
> > 
> > Function v9fs_co_mknod() is entered by 9p server on QEMU main thread. At the
> > beginning of the function it first acquires a read lock on a (per 9p export)
> > global coroutine mutex:
> > 
> >      v9fs_path_read_lock(s);
> > 
> > and holds this lock until returning from function v9fs_co_mknod(). But that's
> > just a read lock. Function v9fs_co_open() also just gains a read lock. So they
> > can happen concurrently.
> > 
> > Then v9fs_co_run_in_worker({...}) is called to dispatch and execute all the
> > code block (think of it as an Obj-C "block") inside this (macro actually) on a
> > QEMU worker thread. So an arbitrary background thread would then call the fs
> > driver functions:
> > 
> >      s->ops->mknod()
> >      v9fs_name_to_path()
> >      s->ops->lstat()
> > 
> > and then at the end of the code block the background thread would dispatch
> > back to QEMU main thread. So when we are reaching:
> > 
> >      v9fs_path_unlock(s);
> > 
> > we are already back on QEMU main thread, hence unlocking on main thread now
> > and finally leaving function v9fs_co_mknod().
> > 
> > The important thing to understand is, while that
> > 
> >      v9fs_co_run_in_worker({...})
> > 
> > code block is executed on a QEMU worker thread, the QEMU main thread (9p
> > server controller portion, i.e. 9p.c) is *not* sleeping, QEMU main thread
> > rather continues to process other (if any) client requests in the meantime. In
> > other words v9fs_co_run_in_worker() neither behaves exactly like Apple's GCD
> > dispatch_async(), nor like dispatch_sync(), as GCD is not coroutine based.
> > 
> > So 9p server might pull a pending 'Topen' client request from the input FIFO
> > in the meantime and likewise dispatch that to a worker thread, etc. Hence a
> > concurrent open() might in theory be possible, but I find it quite unlikely to
> > succeed in practice as the open() call on guest is translated by Linux client
> > into a bunch of synchronous 9p requests on the path passed with the open()
> > call on guest, and a round trip for each 9p message is like what, ~0.3ms or
> > something in this order. That's quite huge compared to the time window I would
> > expect between bind() ... open().
> > 
> > Does this answer your questions?
> 
> The time window may be widened by a malicious actor if the actor knows 
> XNU well so the window length inferred from experiences is not really 
> enough to claim it safe, particularly when considering about security.
> 
> On the other hand, I'm wondering if there is same kind of a time window 
> between s->ops->mknodat() and s->ops->lstat(). Also, there should be 
> similar time windows among operations with "path name based fid" as they 
> also use path names as identifiers. If there is a rationale that it is 
> considered secure, we may be able to apply the same logic to the time 
> window between bind() and chmod() and claim it secure.
> I need a review from someone who understands that part of the code, 
> therefore.
> 

I think Christian's explanation is clear enough. We don't guarantee
that v9fs_co_foo() calls run atomically. As a consequence, the client
might see transient states or be able to interact with an ongoing
request. And to answer your question, we have no specific rationale
on security with that.

I'm not sure what the concerns are but unless you come up with a
valid scenario [*] I don't see any reason to prevent this patch
to go forward.

[*] things like:
    - client escaping the shared directory
    - QEMU crashing
    - QEMU hogging host resources
    - client-side unprivileged user gaining elevated privleges
      in the guest

Cheers,

--
Greg

> Regards,
> Akihiko Odaki

Akihiko Odaki April 27, 2022, 2:27 a.m. UTC | #9

On 2022/04/26 21:38, Greg Kurz wrote:
> On Tue, 26 Apr 2022 12:57:37 +0900
> Akihiko Odaki <akihiko.odaki@gmail.com> wrote:
> 
>> On 2022/04/25 3:45, Christian Schoenebeck wrote:
>>>>>>> +    }
>>>>>>> +    err = chmod(addr.sun_path, mode);
>>>>>>
>>>>>> I'm not sure if it is fine to have a time window between bind() and
>>>>>> chmod(). Do you have some rationale?
>>>>>
>>>>> Good question. QEMU's 9p server is multi-threaded; all 9p requests come in
>>>>> serialized and the 9p server controller portion (9p.c) is only running on
>>>>> QEMU main thread, but the actual filesystem driver calls are then
>>>>> dispatched to QEMU worker threads and therefore running concurrently at
>>>>> this point:
>>>>>
>>>>> https://wiki.qemu.org/Documentation/9p#Threads_and_Coroutines
>>>>>
>>>>> Similar situation on Linux 9p client side: it handles access to a mounted
>>>>> 9p filesystem concurrently, requests are then serialized by 9p driver on
>>>>> Linux and sent over wire to 9p server (host).
>>>>>
>>>>> So yes, there might be implications by that short time windows. But could
>>>>> that be exploited on macOS hosts in practice?
>>>>>
>>>>> The socket file would have mode srwxr-xr-x for a short moment.
>>>>>
>>>>> For security_model=mapped* this should not be a problem.
>>>>>
>>>>> For security_model=none|passhrough, in theory, maybe? But how likely is
>>>>> that? If you are using a Linux client for instance, trying to brute-force
>>>>> opening the socket file, the client would send several 9p commands
>>>>> (Twalk, Tgetattr, Topen, probably more). The time window of the two
>>>>> commands above should be much smaller than that and I would expect one of
>>>>> the 9p commands to error out in between.
>>>>>
>>>>> What would be a viable approach to avoid this issue on macOS?
>>>>
>>>> It is unlikely that a naive brute-force approach will succeed to
>>>> exploit. The more concerning scenario is that the attacker uses the
>>>> knowledge of the underlying implementation of macOS to cause resource
>>>> contention to widen the window. Whether an exploitation is viable
>>>> depends on how much time you spend digging XNU.
>>>>
>>>> However, I'm also not sure if it really *has* a race condition. Looking
>>>> at v9fs_co_mknod(), it sequentially calls s->ops->mknod() and
>>>> s->ops->lstat(). It also results in an entity called "path name based
>>>> fid" in the code, which inherently cannot identify a file when it is
>>>> renamed or recreated.
>>>>
>>>> If there is some rationale it is safe, it may also be applied to the
>>>> sequence of bind() and chmod(). Can anyone explain the sequence of
>>>> s->ops->mknod() and s->ops->lstat() or path name based fid in general?
>>>
>>> You are talking about 9p server's controller level: I don't see something that
>>> would prevent a concurrent open() during this bind() ... chmod() time window
>>> unfortunately.
>>>
>>> Argument 'fidp' passed to function v9fs_co_mknod() reflects the directory in
>>> which the new device file shall be created. So 'fidp' is not the device file
>>> here, nor is 'fidp' modified during this function.
>>>
>>> Function v9fs_co_mknod() is entered by 9p server on QEMU main thread. At the
>>> beginning of the function it first acquires a read lock on a (per 9p export)
>>> global coroutine mutex:
>>>
>>>       v9fs_path_read_lock(s);
>>>
>>> and holds this lock until returning from function v9fs_co_mknod(). But that's
>>> just a read lock. Function v9fs_co_open() also just gains a read lock. So they
>>> can happen concurrently.
>>>
>>> Then v9fs_co_run_in_worker({...}) is called to dispatch and execute all the
>>> code block (think of it as an Obj-C "block") inside this (macro actually) on a
>>> QEMU worker thread. So an arbitrary background thread would then call the fs
>>> driver functions:
>>>
>>>       s->ops->mknod()
>>>       v9fs_name_to_path()
>>>       s->ops->lstat()
>>>
>>> and then at the end of the code block the background thread would dispatch
>>> back to QEMU main thread. So when we are reaching:
>>>
>>>       v9fs_path_unlock(s);
>>>
>>> we are already back on QEMU main thread, hence unlocking on main thread now
>>> and finally leaving function v9fs_co_mknod().
>>>
>>> The important thing to understand is, while that
>>>
>>>       v9fs_co_run_in_worker({...})
>>>
>>> code block is executed on a QEMU worker thread, the QEMU main thread (9p
>>> server controller portion, i.e. 9p.c) is *not* sleeping, QEMU main thread
>>> rather continues to process other (if any) client requests in the meantime. In
>>> other words v9fs_co_run_in_worker() neither behaves exactly like Apple's GCD
>>> dispatch_async(), nor like dispatch_sync(), as GCD is not coroutine based.
>>>
>>> So 9p server might pull a pending 'Topen' client request from the input FIFO
>>> in the meantime and likewise dispatch that to a worker thread, etc. Hence a
>>> concurrent open() might in theory be possible, but I find it quite unlikely to
>>> succeed in practice as the open() call on guest is translated by Linux client
>>> into a bunch of synchronous 9p requests on the path passed with the open()
>>> call on guest, and a round trip for each 9p message is like what, ~0.3ms or
>>> something in this order. That's quite huge compared to the time window I would
>>> expect between bind() ... open().
>>>
>>> Does this answer your questions?
>>
>> The time window may be widened by a malicious actor if the actor knows
>> XNU well so the window length inferred from experiences is not really
>> enough to claim it safe, particularly when considering about security.
>>
>> On the other hand, I'm wondering if there is same kind of a time window
>> between s->ops->mknodat() and s->ops->lstat(). Also, there should be
>> similar time windows among operations with "path name based fid" as they
>> also use path names as identifiers. If there is a rationale that it is
>> considered secure, we may be able to apply the same logic to the time
>> window between bind() and chmod() and claim it secure.
>> I need a review from someone who understands that part of the code,
>> therefore.
>>
> 
> I think Christian's explanation is clear enough. We don't guarantee
> that v9fs_co_foo() calls run atomically. As a consequence, the client
> might see transient states or be able to interact with an ongoing
> request. And to answer your question, we have no specific rationale
> on security with that.
> 
> I'm not sure what the concerns are but unless you come up with a
> valid scenario [*] I don't see any reason to prevent this patch
> to go forward.
> 
> [*] things like:
>      - client escaping the shared directory
>      - QEMU crashing
>      - QEMU hogging host resources
>      - client-side unprivileged user gaining elevated privleges
>        in the guest

I was just not sure if such transient states are safe. The past 
discussion was about the length of the non-atomic time window where a 
path name is used to identify a particular file, but if such states are 
not considered problematic, the length does not matter all and we can 
confidently say the sequence of bind() and chmod() is safe.

Considering the transient states are tolerated in 9pfs, we need to 
design this function to be tolerant with transient states as well. The 
use of chmod() is not safe when we consider about transient states. A 
malicious actor may replace the file at the path with a symlink which 
may escape the shared directory and chmod() will naively follow it. 
chmod() should be replaced with fchmodat_nofollow() or something similar.

Regards,
Akihiko Odaki

> 
> Cheers,
> 
> --
> Greg
> 
>> Regards,
>> Akihiko Odaki
>

Greg Kurz April 27, 2022, 10:18 a.m. UTC | #10

On Wed, 27 Apr 2022 11:27:28 +0900
Akihiko Odaki <akihiko.odaki@gmail.com> wrote:

> On 2022/04/26 21:38, Greg Kurz wrote:

[..skip..]

> > 
> > I think Christian's explanation is clear enough. We don't guarantee
> > that v9fs_co_foo() calls run atomically. As a consequence, the client
> > might see transient states or be able to interact with an ongoing
> > request. And to answer your question, we have no specific rationale
> > on security with that.
> > 
> > I'm not sure what the concerns are but unless you come up with a
> > valid scenario [*] I don't see any reason to prevent this patch
> > to go forward.
> > 
> > [*] things like:
> >      - client escaping the shared directory
> >      - QEMU crashing
> >      - QEMU hogging host resources
> >      - client-side unprivileged user gaining elevated privleges
> >        in the guest
> 
> I was just not sure if such transient states are safe. The past 
> discussion was about the length of the non-atomic time window where a 
> path name is used to identify a particular file, but if such states are 
> not considered problematic, the length does not matter all and we can 
> confidently say the sequence of bind() and chmod() is safe.
> 
> Considering the transient states are tolerated in 9pfs, we need to 
> design this function to be tolerant with transient states as well. The 
> use of chmod() is not safe when we consider about transient states. A 
> malicious actor may replace the file at the path with a symlink which 
> may escape the shared directory and chmod() will naively follow it. 

You get a point here. Thanks for your tenacity ! :-)

> chmod() should be replaced with fchmodat_nofollow() or something similar.
> 

On a GNU/Linux system, this could be achieved by calling fchmod() on
the socket fd *before* calling bind() but I'm afraid this hack might
not work with a BSDish OS.

Replacing chmod() with fchmodat_nofollow(dirfd, addr.sun_path, mode)
won't make things atomic as above but at least it won't follow a
malicious symbolic link : mknod() on the client will fail with
ELOOP, which is fine when it comes to not breaking out of the shared
directory.

This brings up a new problem I hadn't realized before : the
fchmodat_nofollow() implementation in 9p-local.c is really
a linux only thing to cope with AT_SYMLINK_NOFOLLOW not being
supported with fchmodat(). It looks that this should move to
9p-util-linux.c and a proper version should be added for macOS
in 9p-util-darwin.c

Cheers,

--
Greg

> Regards,
> Akihiko Odaki
> 
> > 
> > Cheers,
> > 
> > --
> > Greg
> > 
> >> Regards,
> >> Akihiko Odaki
> > 
>

Christian Schoenebeck April 27, 2022, 12:32 p.m. UTC | #11

On Mittwoch, 27. April 2022 12:18:10 CEST Greg Kurz wrote:
> On Wed, 27 Apr 2022 11:27:28 +0900
> 
> Akihiko Odaki <akihiko.odaki@gmail.com> wrote:
> > On 2022/04/26 21:38, Greg Kurz wrote:
> [..skip..]
> 
> > > I think Christian's explanation is clear enough. We don't guarantee
> > > that v9fs_co_foo() calls run atomically. As a consequence, the client
> > > might see transient states or be able to interact with an ongoing
> > > request. And to answer your question, we have no specific rationale
> > > on security with that.
> > > 
> > > I'm not sure what the concerns are but unless you come up with a
> > > valid scenario [*] I don't see any reason to prevent this patch
> > > to go forward.
> > > 
> > > [*] things like:
> > >      - client escaping the shared directory
> > >      - QEMU crashing
> > >      - QEMU hogging host resources
> > >      - client-side unprivileged user gaining elevated privleges
> > >      
> > >        in the guest
> > 
> > I was just not sure if such transient states are safe. The past
> > discussion was about the length of the non-atomic time window where a
> > path name is used to identify a particular file, but if such states are
> > not considered problematic, the length does not matter all and we can
> > confidently say the sequence of bind() and chmod() is safe.
> > 
> > Considering the transient states are tolerated in 9pfs, we need to
> > design this function to be tolerant with transient states as well. The
> > use of chmod() is not safe when we consider about transient states. A
> > malicious actor may replace the file at the path with a symlink which
> > may escape the shared directory and chmod() will naively follow it.
> 
> You get a point here. Thanks for your tenacity ! :-)

Yep, I send a v4 with fchmodat_nofollow() instead of chmod(), thanks!

BTW, why is it actually allowed for client to create a symlink pointing 
outside exported directory tree with security_model=passthrough/none? Did 
anybody want that?

> > chmod() should be replaced with fchmodat_nofollow() or something similar.
> 
> On a GNU/Linux system, this could be achieved by calling fchmod() on
> the socket fd *before* calling bind() but I'm afraid this hack might
> not work with a BSDish OS.

As you already imagined, this is unfortunately not supported by any BSDs, 
including macOS. I'll file a bug report with Apple though.

> Replacing chmod() with fchmodat_nofollow(dirfd, addr.sun_path, mode)
> won't make things atomic as above but at least it won't follow a
> malicious symbolic link : mknod() on the client will fail with
> ELOOP, which is fine when it comes to not breaking out of the shared
> directory.

Current security_model=passthrough/none already has similar non-atomic 
operations BTW, so this was not something new. E.g.:

static int local_symlink(FsContext *fs_ctx, const char *oldpath,
                         V9fsPath *dir_path, const char *name, FsCred *credp)
{
    ...
    } else if (fs_ctx->export_flags & V9FS_SM_PASSTHROUGH ||
               fs_ctx->export_flags & V9FS_SM_NONE) {
        err = symlinkat(oldpath, dirfd, name);
        if (err) {
            goto out;
        }
        err = fchownat(dirfd, name, credp->fc_uid, credp->fc_gid,
                       AT_SYMLINK_NOFOLLOW);
    ...
}

In general, if you care about a higher degree of security, I'd always 
recommend to use security_model=mapped in the first place.

> This brings up a new problem I hadn't realized before : the
> fchmodat_nofollow() implementation in 9p-local.c is really
> a linux only thing to cope with AT_SYMLINK_NOFOLLOW not being
> supported with fchmodat(). It looks that this should move to
> 9p-util-linux.c and a proper version should be added for macOS
> in 9p-util-darwin.c

Like already agreed on the other thread, yes, that makes sense. But I think 
this can be handled with a follow-up, separate from this series.

Best regards,
Christian Schoenebeck

Greg Kurz April 27, 2022, 1:31 p.m. UTC | #12

On Wed, 27 Apr 2022 14:32:53 +0200
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Mittwoch, 27. April 2022 12:18:10 CEST Greg Kurz wrote:
> > On Wed, 27 Apr 2022 11:27:28 +0900
> > 
> > Akihiko Odaki <akihiko.odaki@gmail.com> wrote:
> > > On 2022/04/26 21:38, Greg Kurz wrote:
> > [..skip..]
> > 
> > > > I think Christian's explanation is clear enough. We don't guarantee
> > > > that v9fs_co_foo() calls run atomically. As a consequence, the client
> > > > might see transient states or be able to interact with an ongoing
> > > > request. And to answer your question, we have no specific rationale
> > > > on security with that.
> > > > 
> > > > I'm not sure what the concerns are but unless you come up with a
> > > > valid scenario [*] I don't see any reason to prevent this patch
> > > > to go forward.
> > > > 
> > > > [*] things like:
> > > >      - client escaping the shared directory
> > > >      - QEMU crashing
> > > >      - QEMU hogging host resources
> > > >      - client-side unprivileged user gaining elevated privleges
> > > >      
> > > >        in the guest
> > > 
> > > I was just not sure if such transient states are safe. The past
> > > discussion was about the length of the non-atomic time window where a
> > > path name is used to identify a particular file, but if such states are
> > > not considered problematic, the length does not matter all and we can
> > > confidently say the sequence of bind() and chmod() is safe.
> > > 
> > > Considering the transient states are tolerated in 9pfs, we need to
> > > design this function to be tolerant with transient states as well. The
> > > use of chmod() is not safe when we consider about transient states. A
> > > malicious actor may replace the file at the path with a symlink which
> > > may escape the shared directory and chmod() will naively follow it.
> > 
> > You get a point here. Thanks for your tenacity ! :-)
> 
> Yep, I send a v4 with fchmodat_nofollow() instead of chmod(), thanks!
> 
> BTW, why is it actually allowed for client to create a symlink pointing 
> outside exported directory tree with security_model=passthrough/none? Did 
> anybody want that?
> 

The target argument to symlink() is merely a string that is stored in
the inode. It is only evaluated as a path at the time an action is
made on the link. Checking at symlink() time is thus useless.

Anyway, we're safe on this side since it's the client's job to
resolve links and we explicitly don't follow them in the server.

> > > chmod() should be replaced with fchmodat_nofollow() or something similar.
> > 
> > On a GNU/Linux system, this could be achieved by calling fchmod() on
> > the socket fd *before* calling bind() but I'm afraid this hack might
> > not work with a BSDish OS.
> 
> As you already imagined, this is unfortunately not supported by any BSDs, 
> including macOS. I'll file a bug report with Apple though.
> 

I'm not sure if this is documented and supported behavior on linux either.

> > Replacing chmod() with fchmodat_nofollow(dirfd, addr.sun_path, mode)
> > won't make things atomic as above but at least it won't follow a
> > malicious symbolic link : mknod() on the client will fail with
> > ELOOP, which is fine when it comes to not breaking out of the shared
> > directory.
> 
> Current security_model=passthrough/none already has similar non-atomic 
> operations BTW, so this was not something new. E.g.:
> 
> static int local_symlink(FsContext *fs_ctx, const char *oldpath,
>                          V9fsPath *dir_path, const char *name, FsCred *credp)
> {
>     ...
>     } else if (fs_ctx->export_flags & V9FS_SM_PASSTHROUGH ||
>                fs_ctx->export_flags & V9FS_SM_NONE) {
>         err = symlinkat(oldpath, dirfd, name);
>         if (err) {
>             goto out;
>         }
>         err = fchownat(dirfd, name, credp->fc_uid, credp->fc_gid,
>                        AT_SYMLINK_NOFOLLOW);
>     ...
> }
> 

Yes similar window but this is secure since fchownat() supports
AT_SYMLINK_NOFOLLOW.

> In general, if you care about a higher degree of security, I'd always 
> recommend to use security_model=mapped in the first place.
> 
> > This brings up a new problem I hadn't realized before : the
> > fchmodat_nofollow() implementation in 9p-local.c is really
> > a linux only thing to cope with AT_SYMLINK_NOFOLLOW not being
> > supported with fchmodat(). It looks that this should move to
> > 9p-util-linux.c and a proper version should be added for macOS
> > in 9p-util-darwin.c
> 
> Like already agreed on the other thread, yes, that makes sense. But I think 
> this can be handled with a follow-up, separate from this series.
> 

Fair enough if you want to handle fchmodat_nofollow() later but you
must at least use fchmodat(AT_SYMLINK_NOFOLLOW) in this patch
instead of chmod().

> Best regards,
> Christian Schoenebeck
> 
>

Christian Schoenebeck April 27, 2022, 4:18 p.m. UTC | #13

On Mittwoch, 27. April 2022 15:31:42 CEST Greg Kurz wrote:
> On Wed, 27 Apr 2022 14:32:53 +0200
> 
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > On Mittwoch, 27. April 2022 12:18:10 CEST Greg Kurz wrote:
> > > On Wed, 27 Apr 2022 11:27:28 +0900
> > > 
> > > Akihiko Odaki <akihiko.odaki@gmail.com> wrote:
> > > > On 2022/04/26 21:38, Greg Kurz wrote:
[...]
> > > > Considering the transient states are tolerated in 9pfs, we need to
> > > > design this function to be tolerant with transient states as well. The
> > > > use of chmod() is not safe when we consider about transient states. A
> > > > malicious actor may replace the file at the path with a symlink which
> > > > may escape the shared directory and chmod() will naively follow it.
> > > 
> > > You get a point here. Thanks for your tenacity ! :-)
> > 
> > Yep, I send a v4 with fchmodat_nofollow() instead of chmod(), thanks!
> > 
> > BTW, why is it actually allowed for client to create a symlink pointing
> > outside exported directory tree with security_model=passthrough/none? Did
> > anybody want that?
> 
> The target argument to symlink() is merely a string that is stored in
> the inode. It is only evaluated as a path at the time an action is
> made on the link. Checking at symlink() time is thus useless.
> 
> Anyway, we're safe on this side since it's the client's job to
> resolve links and we explicitly don't follow them in the server.

I wouldn't call it useless, because it is easier to keep exactly 1 hole
stuffed instead of being forced to continuously stuff N holes as new ones may
popup up over time, as this case shows (mea culpa).

So you are trading (probably minor) performance for security.

But my question was actually meant seriously: whether there was anybody in the
past who probably actually wanted to create relative symlinks outside the
exported tree. For instance for another service with wider host filesystem
access.

[...]
> > > This brings up a new problem I hadn't realized before : the
> > > fchmodat_nofollow() implementation in 9p-local.c is really
> > > a linux only thing to cope with AT_SYMLINK_NOFOLLOW not being
> > > supported with fchmodat(). It looks that this should move to
> > > 9p-util-linux.c and a proper version should be added for macOS
> > > in 9p-util-darwin.c
> > 
> > Like already agreed on the other thread, yes, that makes sense. But I
> > think
> > this can be handled with a follow-up, separate from this series.
> 
> Fair enough if you want to handle fchmodat_nofollow() later but you
> must at least use fchmodat(AT_SYMLINK_NOFOLLOW) in this patch
> instead of chmod().

Sounds like a quick and easy workaround. However looking at 'man fchmodat' on
macOS, this probably does not exactly do what you wanted it to, at least not
in this particular case:

     AT_SYMLINK_NOFOLLOW
             If path names a symbolic link, then the mode of the symbolic link is changed.

     AT_SYMLINK_NOFOLLOW_ANY
             If path names a symbolic link, then the mode of the symbolic link is changed and
             if if the path has any other symbolic links, an error is returned.

So if somebody replaced the socket file by a 1st order symlink, you would
adjust the symlink's permissions with both AT_SYMLINK_NOFOLLOW as well as with 
AT_SYMLINK_NOFOLLOW_ANY. I mean it's better than chmod(), but acceptable?

Using our existing fchmodat_nofollow() instead is no viable alternative
either, as it uses operations that are not supported on socket files on macOS
(tested).

Best regards,
Christian Schoenebeck

Will Cohen April 27, 2022, 5:12 p.m. UTC | #14

On Wed, Apr 27, 2022 at 12:18 PM Christian Schoenebeck <
qemu_oss@crudebyte.com> wrote:

> On Mittwoch, 27. April 2022 15:31:42 CEST Greg Kurz wrote:
> > On Wed, 27 Apr 2022 14:32:53 +0200
> >
> > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > > On Mittwoch, 27. April 2022 12:18:10 CEST Greg Kurz wrote:
> > > > On Wed, 27 Apr 2022 11:27:28 +0900
> > > >
> > > > Akihiko Odaki <akihiko.odaki@gmail.com> wrote:
> > > > > On 2022/04/26 21:38, Greg Kurz wrote:
> [...]
> > > > > Considering the transient states are tolerated in 9pfs, we need to
> > > > > design this function to be tolerant with transient states as well.
> The
> > > > > use of chmod() is not safe when we consider about transient
> states. A
> > > > > malicious actor may replace the file at the path with a symlink
> which
> > > > > may escape the shared directory and chmod() will naively follow it.
> > > >
> > > > You get a point here. Thanks for your tenacity ! :-)
> > >
> > > Yep, I send a v4 with fchmodat_nofollow() instead of chmod(), thanks!
> > >
> > > BTW, why is it actually allowed for client to create a symlink pointing
> > > outside exported directory tree with security_model=passthrough/none?
> Did
> > > anybody want that?
> >
> > The target argument to symlink() is merely a string that is stored in
> > the inode. It is only evaluated as a path at the time an action is
> > made on the link. Checking at symlink() time is thus useless.
> >
> > Anyway, we're safe on this side since it's the client's job to
> > resolve links and we explicitly don't follow them in the server.
>
> I wouldn't call it useless, because it is easier to keep exactly 1 hole
> stuffed instead of being forced to continuously stuff N holes as new ones
> may
> popup up over time, as this case shows (mea culpa).
>
> So you are trading (probably minor) performance for security.
>
> But my question was actually meant seriously: whether there was anybody in
> the
> past who probably actually wanted to create relative symlinks outside the
> exported tree. For instance for another service with wider host filesystem
> access.
>

For what it's worth, the inability to follow symlinks read-write outside of
the tree appears to be, at the moment, the primary pain point for new users
of 9pfs in containerization software (see the later comments in
https://github.com/lima-vm/lima/pull/726 and to a lesser extent
https://github.com/containers/podman/issues/13784).

To my knowledge, neither podman nor lima have come up with conclusive
preferred solutions for how to address this, much less had a robust
discussion with the QEMU team.
The lima discussion notes that it works read-only with passthrough/none, so
I'd suggest that if there weren't users of it before, there are now! As I
understand it, one partial solution for downstream software that allows for
read-write may just be to more proactively mount larger directories to
minimize the number of external paths that symlinks might get tripped up
on. That said, this will stop working when it comes to linking to
additional mounts, since /Volumes on darwin will never line up with /mnt.


> [...]
> > > > This brings up a new problem I hadn't realized before : the
> > > > fchmodat_nofollow() implementation in 9p-local.c is really
> > > > a linux only thing to cope with AT_SYMLINK_NOFOLLOW not being
> > > > supported with fchmodat(). It looks that this should move to
> > > > 9p-util-linux.c and a proper version should be added for macOS
> > > > in 9p-util-darwin.c
> > >
> > > Like already agreed on the other thread, yes, that makes sense. But I
> > > think
> > > this can be handled with a follow-up, separate from this series.
> >
> > Fair enough if you want to handle fchmodat_nofollow() later but you
> > must at least use fchmodat(AT_SYMLINK_NOFOLLOW) in this patch
> > instead of chmod().
>
> Sounds like a quick and easy workaround. However looking at 'man fchmodat'
> on
> macOS, this probably does not exactly do what you wanted it to, at least
> not
> in this particular case:
>
>      AT_SYMLINK_NOFOLLOW
>              If path names a symbolic link, then the mode of the symbolic
> link is changed.
>
>      AT_SYMLINK_NOFOLLOW_ANY
>              If path names a symbolic link, then the mode of the symbolic
> link is changed and
>              if if the path has any other symbolic links, an error is
> returned.
>
> So if somebody replaced the socket file by a 1st order symlink, you would
> adjust the symlink's permissions with both AT_SYMLINK_NOFOLLOW as well as
> with
> AT_SYMLINK_NOFOLLOW_ANY. I mean it's better than chmod(), but acceptable?
>
> Using our existing fchmodat_nofollow() instead is no viable alternative
> either, as it uses operations that are not supported on socket files on
> macOS
> (tested).
>
> Best regards,
> Christian Schoenebeck
>
>
>

Greg Kurz April 27, 2022, 5:37 p.m. UTC | #15

On Wed, 27 Apr 2022 18:18:31 +0200
Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:

> On Mittwoch, 27. April 2022 15:31:42 CEST Greg Kurz wrote:
> > On Wed, 27 Apr 2022 14:32:53 +0200
> > 
> > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > > On Mittwoch, 27. April 2022 12:18:10 CEST Greg Kurz wrote:
> > > > On Wed, 27 Apr 2022 11:27:28 +0900
> > > > 
> > > > Akihiko Odaki <akihiko.odaki@gmail.com> wrote:
> > > > > On 2022/04/26 21:38, Greg Kurz wrote:
> [...]
> > > > > Considering the transient states are tolerated in 9pfs, we need to
> > > > > design this function to be tolerant with transient states as well. The
> > > > > use of chmod() is not safe when we consider about transient states. A
> > > > > malicious actor may replace the file at the path with a symlink which
> > > > > may escape the shared directory and chmod() will naively follow it.
> > > > 
> > > > You get a point here. Thanks for your tenacity ! :-)
> > > 
> > > Yep, I send a v4 with fchmodat_nofollow() instead of chmod(), thanks!
> > > 
> > > BTW, why is it actually allowed for client to create a symlink pointing
> > > outside exported directory tree with security_model=passthrough/none? Did
> > > anybody want that?
> > 
> > The target argument to symlink() is merely a string that is stored in
> > the inode. It is only evaluated as a path at the time an action is
> > made on the link. Checking at symlink() time is thus useless.
> > 
> > Anyway, we're safe on this side since it's the client's job to
> > resolve links and we explicitly don't follow them in the server.
> 
> I wouldn't call it useless, because it is easier to keep exactly 1 hole
> stuffed instead of being forced to continuously stuff N holes as new ones may
> popup up over time, as this case shows (mea culpa).
> 
> So you are trading (probably minor) performance for security.
> 
> But my question was actually meant seriously: whether there was anybody in the
> past who probably actually wanted to create relative symlinks outside the
> exported tree. For instance for another service with wider host filesystem
> access.
> 

I took the question seriously :-), the problem is that even if you
do a check on the target at symlink() time, you don't know how it
will be evaluated in the end.

Quick demonstration:

$ cd /var/tmp/
$ mkdir foo
$ cd foo/
$ # Suppose foo is the jail
$ mkdir bar
$ ln -sf .. bar/link
$ realpath bar/link
/var/tmp/foo
$ # Good, we're still under foo
$ mv bar/link .
$ realpath link
/var/tmp
$ # Ouch we've escaped

So in the end, the only real fix is to ban path based syscalls and
pass AT_SYMLINK_NOFOLLOW everywhere. This was the justification for
rewriting nearly all 9p local in order to fix CVE-2016-9602.

https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg06225.html

> [...]
> > > > This brings up a new problem I hadn't realized before : the
> > > > fchmodat_nofollow() implementation in 9p-local.c is really
> > > > a linux only thing to cope with AT_SYMLINK_NOFOLLOW not being
> > > > supported with fchmodat(). It looks that this should move to
> > > > 9p-util-linux.c and a proper version should be added for macOS
> > > > in 9p-util-darwin.c
> > > 
> > > Like already agreed on the other thread, yes, that makes sense. But I
> > > think
> > > this can be handled with a follow-up, separate from this series.
> > 
> > Fair enough if you want to handle fchmodat_nofollow() later but you
> > must at least use fchmodat(AT_SYMLINK_NOFOLLOW) in this patch
> > instead of chmod().
> 
> Sounds like a quick and easy workaround. However looking at 'man fchmodat' on
> macOS, this probably does not exactly do what you wanted it to, at least not
> in this particular case:
> 
>      AT_SYMLINK_NOFOLLOW
>              If path names a symbolic link, then the mode of the symbolic link is changed.
> 
>      AT_SYMLINK_NOFOLLOW_ANY
>              If path names a symbolic link, then the mode of the symbolic link is changed and
>              if if the path has any other symbolic links, an error is returned.
> 
> So if somebody replaced the socket file by a 1st order symlink, you would
> adjust the symlink's permissions with both AT_SYMLINK_NOFOLLOW as well as with 
> AT_SYMLINK_NOFOLLOW_ANY. I mean it's better than chmod(), but acceptable?
> 

As long as the link is not followed outside, we're good : it will change the
symlink mode and then what ?

> Using our existing fchmodat_nofollow() instead is no viable alternative
> either, as it uses operations that are not supported on socket files on macOS
> (tested).
> 
> Best regards,
> Christian Schoenebeck
> 
>

Christian Schoenebeck April 27, 2022, 6:16 p.m. UTC | #16

On Mittwoch, 27. April 2022 19:12:15 CEST Will Cohen wrote:
> On Wed, Apr 27, 2022 at 12:18 PM Christian Schoenebeck <
> 
> qemu_oss@crudebyte.com> wrote:
> > On Mittwoch, 27. April 2022 15:31:42 CEST Greg Kurz wrote:
> > > On Wed, 27 Apr 2022 14:32:53 +0200
> > > 
> > > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > > > On Mittwoch, 27. April 2022 12:18:10 CEST Greg Kurz wrote:
> > > > > On Wed, 27 Apr 2022 11:27:28 +0900
> > > > > 
> > > > > Akihiko Odaki <akihiko.odaki@gmail.com> wrote:
> > > > > > On 2022/04/26 21:38, Greg Kurz wrote:
> > [...]
> > 
> > > > > > Considering the transient states are tolerated in 9pfs, we need to
> > > > > > design this function to be tolerant with transient states as well.
> > 
> > The
> > 
> > > > > > use of chmod() is not safe when we consider about transient
> > 
> > states. A
> > 
> > > > > > malicious actor may replace the file at the path with a symlink
> > 
> > which
> > 
> > > > > > may escape the shared directory and chmod() will naively follow
> > > > > > it.
> > > > > 
> > > > > You get a point here. Thanks for your tenacity ! :-)
> > > > 
> > > > Yep, I send a v4 with fchmodat_nofollow() instead of chmod(), thanks!
> > > > 
> > > > BTW, why is it actually allowed for client to create a symlink
> > > > pointing
> > > > outside exported directory tree with security_model=passthrough/none?
> > 
> > Did
> > 
> > > > anybody want that?
> > > 
> > > The target argument to symlink() is merely a string that is stored in
> > > the inode. It is only evaluated as a path at the time an action is
> > > made on the link. Checking at symlink() time is thus useless.
> > > 
> > > Anyway, we're safe on this side since it's the client's job to
> > > resolve links and we explicitly don't follow them in the server.
> > 
> > I wouldn't call it useless, because it is easier to keep exactly 1 hole
> > stuffed instead of being forced to continuously stuff N holes as new ones
> > may
> > popup up over time, as this case shows (mea culpa).
> > 
> > So you are trading (probably minor) performance for security.
> > 
> > But my question was actually meant seriously: whether there was anybody in
> > the
> > past who probably actually wanted to create relative symlinks outside the
> > exported tree. For instance for another service with wider host filesystem
> > access.
> 
> For what it's worth, the inability to follow symlinks read-write outside of
> the tree appears to be, at the moment, the primary pain point for new users
> of 9pfs in containerization software (see the later comments in
> https://github.com/lima-vm/lima/pull/726 and to a lesser extent
> https://github.com/containers/podman/issues/13784).
> 
> To my knowledge, neither podman nor lima have come up with conclusive
> preferred solutions for how to address this, much less had a robust
> discussion with the QEMU team.
> The lima discussion notes that it works read-only with passthrough/none, so
> I'd suggest that if there weren't users of it before, there are now! As I
> understand it, one partial solution for downstream software that allows for
> read-write may just be to more proactively mount larger directories to
> minimize the number of external paths that symlinks might get tripped up
> on. That said, this will stop working when it comes to linking to
> additional mounts, since /Volumes on darwin will never line up with /mnt.

That's a different thing. People in those discussions were using 
security_model=mapped where symlinks on guest are virtually mapped as textual 
file content (try 'cat <symlink>' on host). So in this mode symlinks on host 
and symlinks on guest are intentionally separated from each other.

The issue I was referring to was about security_model=passthrough|none which 
has the exact same symlinks accessible both on host and guest side, and more 
specifically I meant: symlinks created by guest that would point to a location 
*above* the 9p export root. E.g. say guest has access to the following host 
directory via 9p, that is access *below* the following directory on host:

  /vm/foo/

And say guest now mounts that host directory and creates a symlink like:

  mount -t 9p foo /mnt
  cd /mnt
  ln -s ../bar bar

That symlink would now point to /bar from guest's PoV, and to /vm/bar from 
host's PoV (i.e. a location on host where guest should not have access to at 
all).

BTW some of the other issues mentioned in the linked discussion, like the 
timeout errors, are fixed with this patch set.

Best regards,
Christian Schoenebeck

Christian Schoenebeck April 27, 2022, 6:36 p.m. UTC | #17

On Mittwoch, 27. April 2022 19:37:39 CEST Greg Kurz wrote:
> On Wed, 27 Apr 2022 18:18:31 +0200
> 
> Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > On Mittwoch, 27. April 2022 15:31:42 CEST Greg Kurz wrote:
> > > On Wed, 27 Apr 2022 14:32:53 +0200
> > > 
> > > Christian Schoenebeck <qemu_oss@crudebyte.com> wrote:
> > > > On Mittwoch, 27. April 2022 12:18:10 CEST Greg Kurz wrote:
> > > > > On Wed, 27 Apr 2022 11:27:28 +0900
> > > > > 
> > > > > Akihiko Odaki <akihiko.odaki@gmail.com> wrote:
> > > > > > On 2022/04/26 21:38, Greg Kurz wrote:
> > [...]
> > 
> > > > > > Considering the transient states are tolerated in 9pfs, we need to
> > > > > > design this function to be tolerant with transient states as well.
> > > > > > The
> > > > > > use of chmod() is not safe when we consider about transient
> > > > > > states. A
> > > > > > malicious actor may replace the file at the path with a symlink
> > > > > > which
> > > > > > may escape the shared directory and chmod() will naively follow
> > > > > > it.
> > > > > 
> > > > > You get a point here. Thanks for your tenacity ! :-)
> > > > 
> > > > Yep, I send a v4 with fchmodat_nofollow() instead of chmod(), thanks!
> > > > 
> > > > BTW, why is it actually allowed for client to create a symlink
> > > > pointing
> > > > outside exported directory tree with security_model=passthrough/none?
> > > > Did
> > > > anybody want that?
> > > 
> > > The target argument to symlink() is merely a string that is stored in
> > > the inode. It is only evaluated as a path at the time an action is
> > > made on the link. Checking at symlink() time is thus useless.
> > > 
> > > Anyway, we're safe on this side since it's the client's job to
> > > resolve links and we explicitly don't follow them in the server.
> > 
> > I wouldn't call it useless, because it is easier to keep exactly 1 hole
> > stuffed instead of being forced to continuously stuff N holes as new ones
> > may popup up over time, as this case shows (mea culpa).
> > 
> > So you are trading (probably minor) performance for security.
> > 
> > But my question was actually meant seriously: whether there was anybody in
> > the past who probably actually wanted to create relative symlinks outside
> > the exported tree. For instance for another service with wider host
> > filesystem access.
> 
> I took the question seriously :-), the problem is that even if you
> do a check on the target at symlink() time, you don't know how it
> will be evaluated in the end.
> 
> Quick demonstration:
> 
> $ cd /var/tmp/
> $ mkdir foo
> $ cd foo/
> $ # Suppose foo is the jail
> $ mkdir bar
> $ ln -sf .. bar/link
> $ realpath bar/link
> /var/tmp/foo
> $ # Good, we're still under foo
> $ mv bar/link .
> $ realpath link
> /var/tmp
> $ # Ouch we've escaped
> 
> So in the end, the only real fix is to ban path based syscalls and
> pass AT_SYMLINK_NOFOLLOW everywhere. This was the justification for
> rewriting nearly all 9p local in order to fix CVE-2016-9602.
> 
> https://lists.gnu.org/archive/html/qemu-devel/2017-01/msg06225.html

Touché :) Agreed, it's not worth it.

I mean this simple example could still be addressed by catching the move, but 
if you have like several nested directories, each with a huge number of 
chained symlinks, on top of it non-atomic issues etc., then things would get 
way expensive to check, as you would actually have to traverse an entire tree 
and validate an even bigger amount of symlink pathes for every single symlink 
modification attempt on guest, probably even with exclusive locks, and so on.

> > [...]
> > 
> > > > > This brings up a new problem I hadn't realized before : the
> > > > > fchmodat_nofollow() implementation in 9p-local.c is really
> > > > > a linux only thing to cope with AT_SYMLINK_NOFOLLOW not being
> > > > > supported with fchmodat(). It looks that this should move to
> > > > > 9p-util-linux.c and a proper version should be added for macOS
> > > > > in 9p-util-darwin.c
> > > > 
> > > > Like already agreed on the other thread, yes, that makes sense. But I
> > > > think
> > > > this can be handled with a follow-up, separate from this series.
> > > 
> > > Fair enough if you want to handle fchmodat_nofollow() later but you
> > > must at least use fchmodat(AT_SYMLINK_NOFOLLOW) in this patch
> > > instead of chmod().
> > 
> > Sounds like a quick and easy workaround. However looking at 'man fchmodat'
> > on macOS, this probably does not exactly do what you wanted it to, at
> > least not> 
> > in this particular case:
> >      AT_SYMLINK_NOFOLLOW
> >      
> >              If path names a symbolic link, then the mode of the symbolic
> >              link is changed.>      
> >      AT_SYMLINK_NOFOLLOW_ANY
> >      
> >              If path names a symbolic link, then the mode of the symbolic
> >              link is changed and if if the path has any other symbolic
> >              links, an error is returned.> 
> > So if somebody replaced the socket file by a 1st order symlink, you would
> > adjust the symlink's permissions with both AT_SYMLINK_NOFOLLOW as well as
> > with AT_SYMLINK_NOFOLLOW_ANY. I mean it's better than chmod(), but
> > acceptable?
> As long as the link is not followed outside, we're good : it will change the
> symlink mode and then what ?

OK, then fine with me. I already filed a bug report with Apple for supporting 
fchmod(socket()). Hopefully we'll have a clean atomic solution in near future 
for this issue then.

I'll post v4 now. Thanks!

Best regards,
Christian Schoenebeck

[v2,2/5] 9pfs: fix qemu_mknodat(S_IFSOCK) on macOS

Commit Message

Comments

Patch