mbox series

[0/3] namei: implement various scoping AT_* flags

Message ID 20180929103453.12025-1-cyphar@cyphar.com (mailing list archive)
Headers show
Series namei: implement various scoping AT_* flags | expand

Message

Aleksa Sarai Sept. 29, 2018, 10:34 a.m. UTC
The need for some sort of control over VFS's path resolution (to avoid
malicious paths resulting in inadvertent breakouts) has been a very
long-standing desire of many userspace applications. This patchset is a
revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.

The most obvious change is that AT_NO_JUMPS has been split as dicussed
in the original thread, along with a further split of AT_NO_PROCLINKS
which means that each individual property of AT_NO_JUMPS is now a
separate flag:

  * Path-based escapes from the starting-point using "/" or ".." are
    blocked by AT_BENEATH.
  * Mountpoint crossings are blocked by AT_XDEV.
  * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
	correctly it actually blocks any user of nd_jump_link() because it
	allows out-of-VFS path resolution manipulation).

AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At
Linus' suggestion in the original thread, I've also implemented
AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
"proclink" resolution).

An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS
path didn't consider "/tmp/.." as a mountpoint crossing -- this patch
blocks this as well (feel free to ask me to remove it if you feel this
is not sane).

Currently I've only enabled these for openat(2) and the stat(2) family.
I would hope we could enable it for basically every *at(2) syscall --
but many of them appear to not have a @flags argument and thus we'll
need to add several new syscalls to do this. I'm more than happy to send
those patches, but I'd prefer to know that this preliminary work is
acceptable before doing a bunch of copy-paste to add new sets of *at(2)
syscalls.

One additional feature I've implemented is AT_THIS_ROOT (I imagine this
is probably going to be more contentious than the refresh of
AT_NO_JUMPS, so I've included it in a separate patch). The patch itself
describes my reasoning, but the shortened version of the premise is that
continer runtimes need to have a way to resolve paths within a
potentially malicious rootfs. Container runtimes currently do this in
userspace[2] which has implicit race conditions that are not resolvable
in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is
inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for
path resolution, which would be invaluable for us -- and the
implementation is basically identical to AT_BENEATH (except that we
don't return errors when someone actually hits the root).

I've added some selftests for this, but it's not clear to me whether
they should live here or in xfstests (as far as I can tell there are no
other VFS tests in selftests, while there are some tests that look like
generic VFS tests in xfstests). If you'd prefer them to be included in
xfstests, let me know.

[1]: https://lore.kernel.org/patchwork/patch/784221/
[2]: https://github.com/cyphar/filepath-securejoin

Aleksa Sarai (3):
  namei: implement O_BENEATH-style AT_* flags
  namei: implement AT_THIS_ROOT chroot-like path resolution
  selftests: vfs: add AT_* path resolution tests

 fs/fcntl.c                                    |   2 +-
 fs/namei.c                                    | 158 ++++++++++++------
 fs/open.c                                     |  10 ++
 fs/stat.c                                     |  15 +-
 include/linux/fcntl.h                         |   3 +-
 include/linux/namei.h                         |   8 +
 include/uapi/asm-generic/fcntl.h              |  20 +++
 include/uapi/linux/fcntl.h                    |  10 ++
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/vfs/.gitignore        |   1 +
 tools/testing/selftests/vfs/Makefile          |  13 ++
 tools/testing/selftests/vfs/at_flags.h        |  40 +++++
 tools/testing/selftests/vfs/common.sh         |  37 ++++
 .../selftests/vfs/tests/0001_at_beneath.sh    |  72 ++++++++
 .../selftests/vfs/tests/0002_at_xdev.sh       |  54 ++++++
 .../vfs/tests/0003_at_no_proclinks.sh         |  50 ++++++
 .../vfs/tests/0004_at_no_symlinks.sh          |  49 ++++++
 .../selftests/vfs/tests/0005_at_this_root.sh  |  66 ++++++++
 tools/testing/selftests/vfs/vfs_helper.c      | 154 +++++++++++++++++
 19 files changed, 707 insertions(+), 56 deletions(-)
 create mode 100644 tools/testing/selftests/vfs/.gitignore
 create mode 100644 tools/testing/selftests/vfs/Makefile
 create mode 100644 tools/testing/selftests/vfs/at_flags.h
 create mode 100644 tools/testing/selftests/vfs/common.sh
 create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh
 create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh
 create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
 create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
 create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh
 create mode 100644 tools/testing/selftests/vfs/vfs_helper.c

Comments

Andy Lutomirski Sept. 29, 2018, 2:25 p.m. UTC | #1
> On Sep 29, 2018, at 3:34 AM, Aleksa Sarai <cyphar@cyphar.com> wrote:
> 
> The need for some sort of control over VFS's path resolution (to avoid
> malicious paths resulting in inadvertent breakouts) has been a very
> long-standing desire of many userspace applications. This patchset is a
> revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
> 
> The most obvious change is that AT_NO_JUMPS has been split as dicussed
> in the original thread, along with a further split of AT_NO_PROCLINKS
> which means that each individual property of AT_NO_JUMPS is now a
> separate flag:
> 
>  * Path-based escapes from the starting-point using "/" or ".." are
>    blocked by AT_BENEATH.

Seems useful.

>  * Mountpoint crossings are blocked by AT_XDEV.

Seems useful.

>  * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
>    correctly it actually blocks any user of nd_jump_link() because it
>    allows out-of-VFS path resolution manipulation).
> 

So how do I disable following symlinks? ISTM the most natural way would be to have AT_NO_SYMLINKS, and to have that flag disable proc links.
Christian Brauner Sept. 29, 2018, 2:38 p.m. UTC | #2
On Sat, Sep 29, 2018 at 08:34:50PM +1000, Aleksa Sarai wrote:
> The need for some sort of control over VFS's path resolution (to avoid
> malicious paths resulting in inadvertent breakouts) has been a very
> long-standing desire of many userspace applications. This patchset is a
> revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
> 
> The most obvious change is that AT_NO_JUMPS has been split as dicussed
> in the original thread, along with a further split of AT_NO_PROCLINKS
> which means that each individual property of AT_NO_JUMPS is now a
> separate flag:
> 
>   * Path-based escapes from the starting-point using "/" or ".." are
>     blocked by AT_BENEATH.
>   * Mountpoint crossings are blocked by AT_XDEV.
>   * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
> 	correctly it actually blocks any user of nd_jump_link() because it
> 	allows out-of-VFS path resolution manipulation).
> 
> AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At
> Linus' suggestion in the original thread, I've also implemented
> AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
> "proclink" resolution).
> 
> An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS
> path didn't consider "/tmp/.." as a mountpoint crossing -- this patch
> blocks this as well (feel free to ask me to remove it if you feel this
> is not sane).

Imho, these flags are very much needed and they all are pretty useful
not just for container runtimes but in general.

> 
> Currently I've only enabled these for openat(2) and the stat(2) family.
> I would hope we could enable it for basically every *at(2) syscall --
> but many of them appear to not have a @flags argument and thus we'll
> need to add several new syscalls to do this. I'm more than happy to send
> those patches, but I'd prefer to know that this preliminary work is
> acceptable before doing a bunch of copy-paste to add new sets of *at(2)
> syscalls.

We should really make sure that we can't make due with openat() alone
before adding a bunch of new syscalls. So there's no need to rush into
this. :)

> 
> One additional feature I've implemented is AT_THIS_ROOT (I imagine this
> is probably going to be more contentious than the refresh of
> AT_NO_JUMPS, so I've included it in a separate patch). The patch itself
> describes my reasoning, but the shortened version of the premise is that
> continer runtimes need to have a way to resolve paths within a
> potentially malicious rootfs. Container runtimes currently do this in
> userspace[2] which has implicit race conditions that are not resolvable
> in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is
> inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for
> path resolution, which would be invaluable for us -- and the
> implementation is basically identical to AT_BENEATH (except that we
> don't return errors when someone actually hits the root).
> 
> I've added some selftests for this, but it's not clear to me whether
> they should live here or in xfstests (as far as I can tell there are no
> other VFS tests in selftests, while there are some tests that look like
> generic VFS tests in xfstests). If you'd prefer them to be included in
> xfstests, let me know.
> 
> [1]: https://lore.kernel.org/patchwork/patch/784221/
> [2]: https://github.com/cyphar/filepath-securejoin
> 
> Aleksa Sarai (3):
>   namei: implement O_BENEATH-style AT_* flags
>   namei: implement AT_THIS_ROOT chroot-like path resolution
>   selftests: vfs: add AT_* path resolution tests
> 
>  fs/fcntl.c                                    |   2 +-
>  fs/namei.c                                    | 158 ++++++++++++------
>  fs/open.c                                     |  10 ++
>  fs/stat.c                                     |  15 +-
>  include/linux/fcntl.h                         |   3 +-
>  include/linux/namei.h                         |   8 +
>  include/uapi/asm-generic/fcntl.h              |  20 +++
>  include/uapi/linux/fcntl.h                    |  10 ++
>  tools/testing/selftests/Makefile              |   1 +
>  tools/testing/selftests/vfs/.gitignore        |   1 +
>  tools/testing/selftests/vfs/Makefile          |  13 ++
>  tools/testing/selftests/vfs/at_flags.h        |  40 +++++
>  tools/testing/selftests/vfs/common.sh         |  37 ++++
>  .../selftests/vfs/tests/0001_at_beneath.sh    |  72 ++++++++
>  .../selftests/vfs/tests/0002_at_xdev.sh       |  54 ++++++
>  .../vfs/tests/0003_at_no_proclinks.sh         |  50 ++++++
>  .../vfs/tests/0004_at_no_symlinks.sh          |  49 ++++++
>  .../selftests/vfs/tests/0005_at_this_root.sh  |  66 ++++++++
>  tools/testing/selftests/vfs/vfs_helper.c      | 154 +++++++++++++++++
>  19 files changed, 707 insertions(+), 56 deletions(-)
>  create mode 100644 tools/testing/selftests/vfs/.gitignore
>  create mode 100644 tools/testing/selftests/vfs/Makefile
>  create mode 100644 tools/testing/selftests/vfs/at_flags.h
>  create mode 100644 tools/testing/selftests/vfs/common.sh
>  create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh
>  create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh
>  create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
>  create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
>  create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh
>  create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
> 
> -- 
> 2.19.0
>
Aleksa Sarai Sept. 29, 2018, 3:45 p.m. UTC | #3
On 2018-09-29, Andy Lutomirski <luto@amacapital.net> wrote:
> > The most obvious change is that AT_NO_JUMPS has been split as dicussed
> > in the original thread, along with a further split of AT_NO_PROCLINKS
> > which means that each individual property of AT_NO_JUMPS is now a
> > separate flag:
> > 
> >  * Path-based escapes from the starting-point using "/" or ".." are
> >    blocked by AT_BENEATH.
> 
> Seems useful.
> 
> >  * Mountpoint crossings are blocked by AT_XDEV.
> 
> Seems useful.
> 
> >  * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
> >    correctly it actually blocks any user of nd_jump_link() because it
> >    allows out-of-VFS path resolution manipulation).
> > 
> 
> So how do I disable following symlinks? ISTM the most natural way
> would be to have AT_NO_SYMLINKS, and to have that flag disable proc
> links.

So, this patchset has both AT_NO_SYMLINKS and AT_NO_PROCLINKS.

* AT_NO_SYMLINKS blocks *all* symlinks (which is something Linus requested
  in the original thread[2] -- apparently this is something that would
  be useful to git even if wouldn't violate AT_BENEATH). This implies
  AT_NO_PROCLINKS.

* AT_NO_PROCLINKS only blocks procfs-style "symlinks" (filesystem
  "symlinks" that call nd_jump_link() themselves -- currently only
  procfs and nsfs).

The reason why we need AT_NO_PROCLINKS is that "proclinks"[*] allow for
breaking-out of nd->root without a trivial way of detecting it (since
the filesystem can manipulate nd->path almost arbitrarily outside of the
control of VFS). Al Viro's original patchset[1] also blocked these but
it was all included within AT_NO_JUMPS.

Requiring you to block *all* symlinks in order to block "proclinks"
seems to be a bit overkill to me (especially if consider that
AT_THIS_ROOT|AT_NO_PROCLINKS is definitely a usecase most container
runtimes would be _very_ interested in -- while AT_NO_SYMLINKS will
cause issues with most distribution images).

[*]: Sorry for the awful naming, I'm not sure what the correct name is
	 (I've called them "super symlinks" in the past) -- if you have a
	 better name please let me know!

[1]: https://lwn.net/Articles/721443/
[2]: https://marc.info/?l=linux-kernel&m=149394765324531&w=2
Andy Lutomirski Sept. 29, 2018, 4:34 p.m. UTC | #4
> On Sep 29, 2018, at 8:45 AM, Aleksa Sarai <cyphar@cyphar.com> wrote:
> 
> On 2018-09-29, Andy Lutomirski <luto@amacapital.net> wrote:
>>> The most obvious change is that AT_NO_JUMPS has been split as dicussed
>>> in the original thread, along with a further split of AT_NO_PROCLINKS
>>> which means that each individual property of AT_NO_JUMPS is now a
>>> separate flag:
>>> 
>>> * Path-based escapes from the starting-point using "/" or ".." are
>>>   blocked by AT_BENEATH.
>> 
>> Seems useful.
>> 
>>> * Mountpoint crossings are blocked by AT_XDEV.
>> 
>> Seems useful.
>> 
>>> * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
>>>   correctly it actually blocks any user of nd_jump_link() because it
>>>   allows out-of-VFS path resolution manipulation).
>>> 
>> 
>> So how do I disable following symlinks? ISTM the most natural way
>> would be to have AT_NO_SYMLINKS, and to have that flag disable proc
>> links.
> 
> So, this patchset has both AT_NO_SYMLINKS and AT_NO_PROCLINKS.

And AT_THIS_ROOT, which is neat. Want to update your cover letter to include all of this?  Or at I just reading the wrong thing?

> 
> * AT_NO_SYMLINKS blocks *all* symlinks (which is something Linus requested
>  in the original thread[2] -- apparently this is something that would
>  be useful to git even if wouldn't violate AT_BENEATH). This implies
>  AT_NO_PROCLINKS.
> 
> * AT_NO_PROCLINKS only blocks procfs-style "symlinks" (filesystem
>  "symlinks" that call nd_jump_link() themselves -- currently only
>  procfs and nsfs).
> 

Hmm. I’m not sure that blocking nsfs links is always what the container runtime wants, but the overall concept sounds quite useful.  Maybe call it AT_NO_TELEPORT?  Or AT_NO_MAGIC_LINKS?

Also, as a perhaps-silly suggestion: if you end up adding a new syscall, I can see a use for a mode that does the path walk but, rather than failing on a disallowed link, stops early and indicates where it stopped. Then web servers, samba, etc can more efficiently implement custom behavior when links are encountered.  And it may also be useful to have a variant of AT_THIS_ROOT where trying to escape is an error instead of having it just get stuck at the root.
Matthew Wilcox Sept. 29, 2018, 7:44 p.m. UTC | #5
On Sat, Sep 29, 2018 at 09:34:24AM -0700, Andy Lutomirski wrote:
> Also, as a perhaps-silly suggestion: if you end up adding a new
> syscall, I can see a use for a mode that does the path walk but, rather
> than failing on a disallowed link, stops early and indicates where it
> stopped. Then web servers, samba, etc can more efficiently implement
> custom behavior when links are encountered.  And it may also be useful
> to have a variant of AT_THIS_ROOT where trying to escape is an error
> instead of having it just get stuck at the root.

AT_USER_LINKS indicating that userspace wants to resolve symlinks
themselves?
Aleksa Sarai Sept. 30, 2018, 4:44 a.m. UTC | #6
On 2018-09-29, Christian Brauner <christian@brauner.io> wrote:
> > Currently I've only enabled these for openat(2) and the stat(2) family.
> > I would hope we could enable it for basically every *at(2) syscall --
> > but many of them appear to not have a @flags argument and thus we'll
> > need to add several new syscalls to do this. I'm more than happy to send
> > those patches, but I'd prefer to know that this preliminary work is
> > acceptable before doing a bunch of copy-paste to add new sets of *at(2)
> > syscalls.
> 
> We should really make sure that we can't make due with openat() alone
> before adding a bunch of new syscalls. So there's no need to rush into
> this. :)

Yeah, I think that we could (mostly) make do with openat(2). We might
need to have renameat(2) and a few others, but if we had more support
for AT_EMPTY_PATH you should be able to just O_PATH|O_{BENEATH,XDEV,...}
and then operate on the O_PATH fd.
Alban Crequy Sept. 30, 2018, 1:54 p.m. UTC | #7
On Sat, Sep 29, 2018 at 12:35 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> The need for some sort of control over VFS's path resolution (to avoid
> malicious paths resulting in inadvertent breakouts) has been a very
> long-standing desire of many userspace applications. This patchset is a
> revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
>
> The most obvious change is that AT_NO_JUMPS has been split as dicussed
> in the original thread, along with a further split of AT_NO_PROCLINKS
> which means that each individual property of AT_NO_JUMPS is now a
> separate flag:
>
>   * Path-based escapes from the starting-point using "/" or ".." are
>     blocked by AT_BENEATH.
>   * Mountpoint crossings are blocked by AT_XDEV.
>   * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
>         correctly it actually blocks any user of nd_jump_link() because it
>         allows out-of-VFS path resolution manipulation).
>
> AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At
> Linus' suggestion in the original thread, I've also implemented
> AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
> "proclink" resolution).

It seems quite useful to me.

> An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS
> path didn't consider "/tmp/.." as a mountpoint crossing -- this patch
> blocks this as well (feel free to ask me to remove it if you feel this
> is not sane).
>
> Currently I've only enabled these for openat(2) and the stat(2) family.
> I would hope we could enable it for basically every *at(2) syscall --
> but many of them appear to not have a @flags argument and thus we'll
> need to add several new syscalls to do this. I'm more than happy to send
> those patches, but I'd prefer to know that this preliminary work is
> acceptable before doing a bunch of copy-paste to add new sets of *at(2)
> syscalls.

What do you think of an equivalent feature AT_NO_SYMLINKS flag for mount()?

I guess that would have made the fix for CVE-2017-1002101 in
Kubernetes easier to write:
https://kubernetes.io/blog/2018/04/04/fixing-subpath-volume-vulnerability/

> One additional feature I've implemented is AT_THIS_ROOT (I imagine this
> is probably going to be more contentious than the refresh of
> AT_NO_JUMPS, so I've included it in a separate patch). The patch itself
> describes my reasoning, but the shortened version of the premise is that
> continer runtimes need to have a way to resolve paths within a
> potentially malicious rootfs. Container runtimes currently do this in
> userspace[2] which has implicit race conditions that are not resolvable
> in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is
> inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for
> path resolution, which would be invaluable for us -- and the
> implementation is basically identical to AT_BENEATH (except that we
> don't return errors when someone actually hits the root).
>
> I've added some selftests for this, but it's not clear to me whether
> they should live here or in xfstests (as far as I can tell there are no
> other VFS tests in selftests, while there are some tests that look like
> generic VFS tests in xfstests). If you'd prefer them to be included in
> xfstests, let me know.
>
> [1]: https://lore.kernel.org/patchwork/patch/784221/
> [2]: https://github.com/cyphar/filepath-securejoin
>
> Aleksa Sarai (3):
>   namei: implement O_BENEATH-style AT_* flags
>   namei: implement AT_THIS_ROOT chroot-like path resolution
>   selftests: vfs: add AT_* path resolution tests
>
>  fs/fcntl.c                                    |   2 +-
>  fs/namei.c                                    | 158 ++++++++++++------
>  fs/open.c                                     |  10 ++
>  fs/stat.c                                     |  15 +-
>  include/linux/fcntl.h                         |   3 +-
>  include/linux/namei.h                         |   8 +
>  include/uapi/asm-generic/fcntl.h              |  20 +++
>  include/uapi/linux/fcntl.h                    |  10 ++
>  tools/testing/selftests/Makefile              |   1 +
>  tools/testing/selftests/vfs/.gitignore        |   1 +
>  tools/testing/selftests/vfs/Makefile          |  13 ++
>  tools/testing/selftests/vfs/at_flags.h        |  40 +++++
>  tools/testing/selftests/vfs/common.sh         |  37 ++++
>  .../selftests/vfs/tests/0001_at_beneath.sh    |  72 ++++++++
>  .../selftests/vfs/tests/0002_at_xdev.sh       |  54 ++++++
>  .../vfs/tests/0003_at_no_proclinks.sh         |  50 ++++++
>  .../vfs/tests/0004_at_no_symlinks.sh          |  49 ++++++
>  .../selftests/vfs/tests/0005_at_this_root.sh  |  66 ++++++++
>  tools/testing/selftests/vfs/vfs_helper.c      | 154 +++++++++++++++++
>  19 files changed, 707 insertions(+), 56 deletions(-)
>  create mode 100644 tools/testing/selftests/vfs/.gitignore
>  create mode 100644 tools/testing/selftests/vfs/Makefile
>  create mode 100644 tools/testing/selftests/vfs/at_flags.h
>  create mode 100644 tools/testing/selftests/vfs/common.sh
>  create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh
>  create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh
>  create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
>  create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
>  create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh
>  create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
>
> --
> 2.19.0
Christian Brauner Sept. 30, 2018, 2:02 p.m. UTC | #8
On September 30, 2018 3:54:31 PM GMT+02:00, Alban Crequy <alban@kinvolk.io> wrote:
>On Sat, Sep 29, 2018 at 12:35 PM Aleksa Sarai <cyphar@cyphar.com>
>wrote:
>>
>> The need for some sort of control over VFS's path resolution (to
>avoid
>> malicious paths resulting in inadvertent breakouts) has been a very
>> long-standing desire of many userspace applications. This patchset is
>a
>> revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few
>additions.
>>
>> The most obvious change is that AT_NO_JUMPS has been split as
>dicussed
>> in the original thread, along with a further split of AT_NO_PROCLINKS
>> which means that each individual property of AT_NO_JUMPS is now a
>> separate flag:
>>
>>   * Path-based escapes from the starting-point using "/" or ".." are
>>     blocked by AT_BENEATH.
>>   * Mountpoint crossings are blocked by AT_XDEV.
>>   * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
>>         correctly it actually blocks any user of nd_jump_link()
>because it
>>         allows out-of-VFS path resolution manipulation).
>>
>> AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS).
>At
>> Linus' suggestion in the original thread, I've also implemented
>> AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
>> "proclink" resolution).
>
>It seems quite useful to me.
>
>> An additional improvement was made to AT_XDEV. The original
>AT_NO_JUMPS
>> path didn't consider "/tmp/.." as a mountpoint crossing -- this patch
>> blocks this as well (feel free to ask me to remove it if you feel
>this
>> is not sane).
>>
>> Currently I've only enabled these for openat(2) and the stat(2)
>family.
>> I would hope we could enable it for basically every *at(2) syscall --
>> but many of them appear to not have a @flags argument and thus we'll
>> need to add several new syscalls to do this. I'm more than happy to
>send
>> those patches, but I'd prefer to know that this preliminary work is
>> acceptable before doing a bunch of copy-paste to add new sets of
>*at(2)
>> syscalls.
>
>What do you think of an equivalent feature AT_NO_SYMLINKS flag for
>mount()?

That's something we discussed  but that would need to be part of the new mount API work by David. The current mount API doesn't take AT_* flags since it doesn't operate on fds and we're (sort of) out of mount flags.

>
>I guess that would have made the fix for CVE-2017-1002101 in
>Kubernetes easier to write:
>https://kubernetes.io/blog/2018/04/04/fixing-subpath-volume-vulnerability/
>
>> One additional feature I've implemented is AT_THIS_ROOT (I imagine
>this
>> is probably going to be more contentious than the refresh of
>> AT_NO_JUMPS, so I've included it in a separate patch). The patch
>itself
>> describes my reasoning, but the shortened version of the premise is
>that
>> continer runtimes need to have a way to resolve paths within a
>> potentially malicious rootfs. Container runtimes currently do this in
>> userspace[2] which has implicit race conditions that are not
>resolvable
>> in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is
>> inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics
>for
>> path resolution, which would be invaluable for us -- and the
>> implementation is basically identical to AT_BENEATH (except that we
>> don't return errors when someone actually hits the root).
>>
>> I've added some selftests for this, but it's not clear to me whether
>> they should live here or in xfstests (as far as I can tell there are
>no
>> other VFS tests in selftests, while there are some tests that look
>like
>> generic VFS tests in xfstests). If you'd prefer them to be included
>in
>> xfstests, let me know.
>>
>> [1]: https://lore.kernel.org/patchwork/patch/784221/
>> [2]: https://github.com/cyphar/filepath-securejoin
>>
>> Aleksa Sarai (3):
>>   namei: implement O_BENEATH-style AT_* flags
>>   namei: implement AT_THIS_ROOT chroot-like path resolution
>>   selftests: vfs: add AT_* path resolution tests
>>
>>  fs/fcntl.c                                    |   2 +-
>>  fs/namei.c                                    | 158
>++++++++++++------
>>  fs/open.c                                     |  10 ++
>>  fs/stat.c                                     |  15 +-
>>  include/linux/fcntl.h                         |   3 +-
>>  include/linux/namei.h                         |   8 +
>>  include/uapi/asm-generic/fcntl.h              |  20 +++
>>  include/uapi/linux/fcntl.h                    |  10 ++
>>  tools/testing/selftests/Makefile              |   1 +
>>  tools/testing/selftests/vfs/.gitignore        |   1 +
>>  tools/testing/selftests/vfs/Makefile          |  13 ++
>>  tools/testing/selftests/vfs/at_flags.h        |  40 +++++
>>  tools/testing/selftests/vfs/common.sh         |  37 ++++
>>  .../selftests/vfs/tests/0001_at_beneath.sh    |  72 ++++++++
>>  .../selftests/vfs/tests/0002_at_xdev.sh       |  54 ++++++
>>  .../vfs/tests/0003_at_no_proclinks.sh         |  50 ++++++
>>  .../vfs/tests/0004_at_no_symlinks.sh          |  49 ++++++
>>  .../selftests/vfs/tests/0005_at_this_root.sh  |  66 ++++++++
>>  tools/testing/selftests/vfs/vfs_helper.c      | 154
>+++++++++++++++++
>>  19 files changed, 707 insertions(+), 56 deletions(-)
>>  create mode 100644 tools/testing/selftests/vfs/.gitignore
>>  create mode 100644 tools/testing/selftests/vfs/Makefile
>>  create mode 100644 tools/testing/selftests/vfs/at_flags.h
>>  create mode 100644 tools/testing/selftests/vfs/common.sh
>>  create mode 100755
>tools/testing/selftests/vfs/tests/0001_at_beneath.sh
>>  create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh
>>  create mode 100755
>tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
>>  create mode 100755
>tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
>>  create mode 100755
>tools/testing/selftests/vfs/tests/0005_at_this_root.sh
>>  create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
>>
>> --
>> 2.19.0
Mickaël Salaün Sept. 30, 2018, 7:45 p.m. UTC | #9
As a side note, I'm still working on Landlock which can achieve the same
goal but in a more flexible and dynamic way: https://landlock.io

Regards,
 Mickaël

On 9/29/18 12:34, Aleksa Sarai wrote:
> The need for some sort of control over VFS's path resolution (to avoid
> malicious paths resulting in inadvertent breakouts) has been a very
> long-standing desire of many userspace applications. This patchset is a
> revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
> 
> The most obvious change is that AT_NO_JUMPS has been split as dicussed
> in the original thread, along with a further split of AT_NO_PROCLINKS
> which means that each individual property of AT_NO_JUMPS is now a
> separate flag:
> 
>   * Path-based escapes from the starting-point using "/" or ".." are
>     blocked by AT_BENEATH.
>   * Mountpoint crossings are blocked by AT_XDEV.
>   * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
> 	correctly it actually blocks any user of nd_jump_link() because it
> 	allows out-of-VFS path resolution manipulation).
> 
> AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At
> Linus' suggestion in the original thread, I've also implemented
> AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
> "proclink" resolution).
> 
> An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS
> path didn't consider "/tmp/.." as a mountpoint crossing -- this patch
> blocks this as well (feel free to ask me to remove it if you feel this
> is not sane).
> 
> Currently I've only enabled these for openat(2) and the stat(2) family.
> I would hope we could enable it for basically every *at(2) syscall --
> but many of them appear to not have a @flags argument and thus we'll
> need to add several new syscalls to do this. I'm more than happy to send
> those patches, but I'd prefer to know that this preliminary work is
> acceptable before doing a bunch of copy-paste to add new sets of *at(2)
> syscalls.
> 
> One additional feature I've implemented is AT_THIS_ROOT (I imagine this
> is probably going to be more contentious than the refresh of
> AT_NO_JUMPS, so I've included it in a separate patch). The patch itself
> describes my reasoning, but the shortened version of the premise is that
> continer runtimes need to have a way to resolve paths within a
> potentially malicious rootfs. Container runtimes currently do this in
> userspace[2] which has implicit race conditions that are not resolvable
> in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is
> inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for
> path resolution, which would be invaluable for us -- and the
> implementation is basically identical to AT_BENEATH (except that we
> don't return errors when someone actually hits the root).
> 
> I've added some selftests for this, but it's not clear to me whether
> they should live here or in xfstests (as far as I can tell there are no
> other VFS tests in selftests, while there are some tests that look like
> generic VFS tests in xfstests). If you'd prefer them to be included in
> xfstests, let me know.
> 
> [1]: https://lore.kernel.org/patchwork/patch/784221/
> [2]: https://github.com/cyphar/filepath-securejoin
> 
> Aleksa Sarai (3):
>   namei: implement O_BENEATH-style AT_* flags
>   namei: implement AT_THIS_ROOT chroot-like path resolution
>   selftests: vfs: add AT_* path resolution tests
> 
>  fs/fcntl.c                                    |   2 +-
>  fs/namei.c                                    | 158 ++++++++++++------
>  fs/open.c                                     |  10 ++
>  fs/stat.c                                     |  15 +-
>  include/linux/fcntl.h                         |   3 +-
>  include/linux/namei.h                         |   8 +
>  include/uapi/asm-generic/fcntl.h              |  20 +++
>  include/uapi/linux/fcntl.h                    |  10 ++
>  tools/testing/selftests/Makefile              |   1 +
>  tools/testing/selftests/vfs/.gitignore        |   1 +
>  tools/testing/selftests/vfs/Makefile          |  13 ++
>  tools/testing/selftests/vfs/at_flags.h        |  40 +++++
>  tools/testing/selftests/vfs/common.sh         |  37 ++++
>  .../selftests/vfs/tests/0001_at_beneath.sh    |  72 ++++++++
>  .../selftests/vfs/tests/0002_at_xdev.sh       |  54 ++++++
>  .../vfs/tests/0003_at_no_proclinks.sh         |  50 ++++++
>  .../vfs/tests/0004_at_no_symlinks.sh          |  49 ++++++
>  .../selftests/vfs/tests/0005_at_this_root.sh  |  66 ++++++++
>  tools/testing/selftests/vfs/vfs_helper.c      | 154 +++++++++++++++++
>  19 files changed, 707 insertions(+), 56 deletions(-)
>  create mode 100644 tools/testing/selftests/vfs/.gitignore
>  create mode 100644 tools/testing/selftests/vfs/Makefile
>  create mode 100644 tools/testing/selftests/vfs/at_flags.h
>  create mode 100644 tools/testing/selftests/vfs/common.sh
>  create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh
>  create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh
>  create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
>  create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
>  create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh
>  create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
>
Jann Horn Sept. 30, 2018, 9:46 p.m. UTC | #10
On Sun, Sep 30, 2018 at 10:39 PM Mickaël Salaün <mic@digikod.net> wrote:
> As a side note, I'm still working on Landlock which can achieve the same
> goal but in a more flexible and dynamic way: https://landlock.io

Isn't Landlock mostly intended for userspace that wants to impose a
custom Mandatory Access Control policy on itself, restricting the
whole process?

As far as I can tell, a major usecase for AT_BENEATH are privileged
processes that do not want to restrict all filesystem operations they
perform, but want to sometimes impose limits on filesystem traversal
for the duration of a single system call. For example, a process might
want to first open a file from an untrusted filesystem area with
AT_BENEATH, and afterwards open a configuration file without
AT_BENEATH.

How would you do this in Landlock? Use a BPF map to store per-thread
filesystem restrictions, and then do bpf() calls before and after
every restricted filesystem access to set and unset the policy for the
current syscall?

> On 9/29/18 12:34, Aleksa Sarai wrote:
> > The need for some sort of control over VFS's path resolution (to avoid
> > malicious paths resulting in inadvertent breakouts) has been a very
> > long-standing desire of many userspace applications. This patchset is a
> > revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
> >
> > The most obvious change is that AT_NO_JUMPS has been split as dicussed
> > in the original thread, along with a further split of AT_NO_PROCLINKS
> > which means that each individual property of AT_NO_JUMPS is now a
> > separate flag:
> >
> >   * Path-based escapes from the starting-point using "/" or ".." are
> >     blocked by AT_BENEATH.
> >   * Mountpoint crossings are blocked by AT_XDEV.
> >   * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
> >       correctly it actually blocks any user of nd_jump_link() because it
> >       allows out-of-VFS path resolution manipulation).
> >
> > AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At
> > Linus' suggestion in the original thread, I've also implemented
> > AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
> > "proclink" resolution).
> >
> > An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS
> > path didn't consider "/tmp/.." as a mountpoint crossing -- this patch
> > blocks this as well (feel free to ask me to remove it if you feel this
> > is not sane).
> >
> > Currently I've only enabled these for openat(2) and the stat(2) family.
> > I would hope we could enable it for basically every *at(2) syscall --
> > but many of them appear to not have a @flags argument and thus we'll
> > need to add several new syscalls to do this. I'm more than happy to send
> > those patches, but I'd prefer to know that this preliminary work is
> > acceptable before doing a bunch of copy-paste to add new sets of *at(2)
> > syscalls.
> >
> > One additional feature I've implemented is AT_THIS_ROOT (I imagine this
> > is probably going to be more contentious than the refresh of
> > AT_NO_JUMPS, so I've included it in a separate patch). The patch itself
> > describes my reasoning, but the shortened version of the premise is that
> > continer runtimes need to have a way to resolve paths within a
> > potentially malicious rootfs. Container runtimes currently do this in
> > userspace[2] which has implicit race conditions that are not resolvable
> > in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is
> > inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for
> > path resolution, which would be invaluable for us -- and the
> > implementation is basically identical to AT_BENEATH (except that we
> > don't return errors when someone actually hits the root).
> >
> > I've added some selftests for this, but it's not clear to me whether
> > they should live here or in xfstests (as far as I can tell there are no
> > other VFS tests in selftests, while there are some tests that look like
> > generic VFS tests in xfstests). If you'd prefer them to be included in
> > xfstests, let me know.
> >
> > [1]: https://lore.kernel.org/patchwork/patch/784221/
> > [2]: https://github.com/cyphar/filepath-securejoin
> >
> > Aleksa Sarai (3):
> >   namei: implement O_BENEATH-style AT_* flags
> >   namei: implement AT_THIS_ROOT chroot-like path resolution
> >   selftests: vfs: add AT_* path resolution tests
> >
> >  fs/fcntl.c                                    |   2 +-
> >  fs/namei.c                                    | 158 ++++++++++++------
> >  fs/open.c                                     |  10 ++
> >  fs/stat.c                                     |  15 +-
> >  include/linux/fcntl.h                         |   3 +-
> >  include/linux/namei.h                         |   8 +
> >  include/uapi/asm-generic/fcntl.h              |  20 +++
> >  include/uapi/linux/fcntl.h                    |  10 ++
> >  tools/testing/selftests/Makefile              |   1 +
> >  tools/testing/selftests/vfs/.gitignore        |   1 +
> >  tools/testing/selftests/vfs/Makefile          |  13 ++
> >  tools/testing/selftests/vfs/at_flags.h        |  40 +++++
> >  tools/testing/selftests/vfs/common.sh         |  37 ++++
> >  .../selftests/vfs/tests/0001_at_beneath.sh    |  72 ++++++++
> >  .../selftests/vfs/tests/0002_at_xdev.sh       |  54 ++++++
> >  .../vfs/tests/0003_at_no_proclinks.sh         |  50 ++++++
> >  .../vfs/tests/0004_at_no_symlinks.sh          |  49 ++++++
> >  .../selftests/vfs/tests/0005_at_this_root.sh  |  66 ++++++++
> >  tools/testing/selftests/vfs/vfs_helper.c      | 154 +++++++++++++++++
> >  19 files changed, 707 insertions(+), 56 deletions(-)
> >  create mode 100644 tools/testing/selftests/vfs/.gitignore
> >  create mode 100644 tools/testing/selftests/vfs/Makefile
> >  create mode 100644 tools/testing/selftests/vfs/at_flags.h
> >  create mode 100644 tools/testing/selftests/vfs/common.sh
> >  create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh
> >  create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh
> >  create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
> >  create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
> >  create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh
> >  create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
> >
>
Mickaël Salaün Sept. 30, 2018, 10:37 p.m. UTC | #11
On 9/30/18 23:46, Jann Horn wrote:
> On Sun, Sep 30, 2018 at 10:39 PM Mickaël Salaün <mic@digikod.net> wrote:
>> As a side note, I'm still working on Landlock which can achieve the same
>> goal but in a more flexible and dynamic way: https://landlock.io
> 
> Isn't Landlock mostly intended for userspace that wants to impose a
> custom Mandatory Access Control policy on itself, restricting the
> whole process?
> 
> As far as I can tell, a major usecase for AT_BENEATH are privileged
> processes that do not want to restrict all filesystem operations they
> perform, but want to sometimes impose limits on filesystem traversal
> for the duration of a single system call. For example, a process might
> want to first open a file from an untrusted filesystem area with
> AT_BENEATH, and afterwards open a configuration file without
> AT_BENEATH.

I didn't realized this was the main use case for AT_BENEATH. Landlock is
indeed dedicated to apply a security policy on a set of processes. This
set can be a process and its children (seccomp-like), or another set of
processes that may be identified with a cgroup.

> 
> How would you do this in Landlock? Use a BPF map to store per-thread
> filesystem restrictions, and then do bpf() calls before and after
> every restricted filesystem access to set and unset the policy for the
> current syscall?

Another way to apply a security policy could be to tied it to a file
descriptor, similarly to Capsicum, which could enable to create
programmable (real) capabilities. This way, it would be possible to
"wrap" a file descriptor with a Landlock program and use it with
FD-based syscalls or pass it to other processes. This would not require
changes to the FS subsystem, but only the Landlock LSM code. This isn't
done yet but I plan to add this new way to restrict operations on file
descriptors.

Anyway, for the use case you mentioned, the AT_BENEATH flag(s) should be
simple to use and enough for now. We must be careful of the hardcoded
policy though.


> 
>> On 9/29/18 12:34, Aleksa Sarai wrote:
>>> The need for some sort of control over VFS's path resolution (to avoid
>>> malicious paths resulting in inadvertent breakouts) has been a very
>>> long-standing desire of many userspace applications. This patchset is a
>>> revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
>>>
>>> The most obvious change is that AT_NO_JUMPS has been split as dicussed
>>> in the original thread, along with a further split of AT_NO_PROCLINKS
>>> which means that each individual property of AT_NO_JUMPS is now a
>>> separate flag:
>>>
>>>   * Path-based escapes from the starting-point using "/" or ".." are
>>>     blocked by AT_BENEATH.
>>>   * Mountpoint crossings are blocked by AT_XDEV.
>>>   * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
>>>       correctly it actually blocks any user of nd_jump_link() because it
>>>       allows out-of-VFS path resolution manipulation).
>>>
>>> AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At
>>> Linus' suggestion in the original thread, I've also implemented
>>> AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
>>> "proclink" resolution).
>>>
>>> An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS
>>> path didn't consider "/tmp/.." as a mountpoint crossing -- this patch
>>> blocks this as well (feel free to ask me to remove it if you feel this
>>> is not sane).
>>>
>>> Currently I've only enabled these for openat(2) and the stat(2) family.
>>> I would hope we could enable it for basically every *at(2) syscall --
>>> but many of them appear to not have a @flags argument and thus we'll
>>> need to add several new syscalls to do this. I'm more than happy to send
>>> those patches, but I'd prefer to know that this preliminary work is
>>> acceptable before doing a bunch of copy-paste to add new sets of *at(2)
>>> syscalls.
>>>
>>> One additional feature I've implemented is AT_THIS_ROOT (I imagine this
>>> is probably going to be more contentious than the refresh of
>>> AT_NO_JUMPS, so I've included it in a separate patch). The patch itself
>>> describes my reasoning, but the shortened version of the premise is that
>>> continer runtimes need to have a way to resolve paths within a
>>> potentially malicious rootfs. Container runtimes currently do this in
>>> userspace[2] which has implicit race conditions that are not resolvable
>>> in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is
>>> inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for
>>> path resolution, which would be invaluable for us -- and the
>>> implementation is basically identical to AT_BENEATH (except that we
>>> don't return errors when someone actually hits the root).
>>>
>>> I've added some selftests for this, but it's not clear to me whether
>>> they should live here or in xfstests (as far as I can tell there are no
>>> other VFS tests in selftests, while there are some tests that look like
>>> generic VFS tests in xfstests). If you'd prefer them to be included in
>>> xfstests, let me know.
>>>
>>> [1]: https://lore.kernel.org/patchwork/patch/784221/
>>> [2]: https://github.com/cyphar/filepath-securejoin
>>>
>>> Aleksa Sarai (3):
>>>   namei: implement O_BENEATH-style AT_* flags
>>>   namei: implement AT_THIS_ROOT chroot-like path resolution
>>>   selftests: vfs: add AT_* path resolution tests
>>>
>>>  fs/fcntl.c                                    |   2 +-
>>>  fs/namei.c                                    | 158 ++++++++++++------
>>>  fs/open.c                                     |  10 ++
>>>  fs/stat.c                                     |  15 +-
>>>  include/linux/fcntl.h                         |   3 +-
>>>  include/linux/namei.h                         |   8 +
>>>  include/uapi/asm-generic/fcntl.h              |  20 +++
>>>  include/uapi/linux/fcntl.h                    |  10 ++
>>>  tools/testing/selftests/Makefile              |   1 +
>>>  tools/testing/selftests/vfs/.gitignore        |   1 +
>>>  tools/testing/selftests/vfs/Makefile          |  13 ++
>>>  tools/testing/selftests/vfs/at_flags.h        |  40 +++++
>>>  tools/testing/selftests/vfs/common.sh         |  37 ++++
>>>  .../selftests/vfs/tests/0001_at_beneath.sh    |  72 ++++++++
>>>  .../selftests/vfs/tests/0002_at_xdev.sh       |  54 ++++++
>>>  .../vfs/tests/0003_at_no_proclinks.sh         |  50 ++++++
>>>  .../vfs/tests/0004_at_no_symlinks.sh          |  49 ++++++
>>>  .../selftests/vfs/tests/0005_at_this_root.sh  |  66 ++++++++
>>>  tools/testing/selftests/vfs/vfs_helper.c      | 154 +++++++++++++++++
>>>  19 files changed, 707 insertions(+), 56 deletions(-)
>>>  create mode 100644 tools/testing/selftests/vfs/.gitignore
>>>  create mode 100644 tools/testing/selftests/vfs/Makefile
>>>  create mode 100644 tools/testing/selftests/vfs/at_flags.h
>>>  create mode 100644 tools/testing/selftests/vfs/common.sh
>>>  create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh
>>>  create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh
>>>  create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
>>>  create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
>>>  create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh
>>>  create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
>>>
>>
> 
>
Dave Chinner Oct. 1, 2018, 4:08 a.m. UTC | #12
On Sat, Sep 29, 2018 at 08:34:50PM +1000, Aleksa Sarai wrote:
> I've added some selftests for this, but it's not clear to me whether
> they should live here or in xfstests (as far as I can tell there are no
> other VFS tests in selftests, while there are some tests that look like
> generic VFS tests in xfstests). If you'd prefer them to be included in
> xfstests, let me know.

xfstests, please. That way the new functionality will get immediate
coverage by all the main filesystem development and distro QA
teams....

Cheers,

Dave.
Aleksa Sarai Oct. 1, 2018, 5:47 a.m. UTC | #13
On 2018-10-01, Dave Chinner <david@fromorbit.com> wrote:
> > I've added some selftests for this, but it's not clear to me whether
> > they should live here or in xfstests (as far as I can tell there are no
> > other VFS tests in selftests, while there are some tests that look like
> > generic VFS tests in xfstests). If you'd prefer them to be included in
> > xfstests, let me know.
> 
> xfstests, please. That way the new functionality will get immediate
> coverage by all the main filesystem development and distro QA
> teams....

Sure, will do. Do you want me to submit them in parallel -- and what is
the correct ML to submit changes to xfstests? Sorry for the silly
questions. :P
Dave Chinner Oct. 1, 2018, 6:14 a.m. UTC | #14
On Mon, Oct 01, 2018 at 03:47:23PM +1000, Aleksa Sarai wrote:
> On 2018-10-01, Dave Chinner <david@fromorbit.com> wrote:
> > > I've added some selftests for this, but it's not clear to me whether
> > > they should live here or in xfstests (as far as I can tell there are no
> > > other VFS tests in selftests, while there are some tests that look like
> > > generic VFS tests in xfstests). If you'd prefer them to be included in
> > > xfstests, let me know.
> > 
> > xfstests, please. That way the new functionality will get immediate
> > coverage by all the main filesystem development and distro QA
> > teams....
> 
> Sure, will do. Do you want me to submit them in parallel --

That's usually the way we do things, but we don't tend to commit the
fstests changes until the thing it is testing has landed upstream.

> and what is
> the correct ML to submit changes to xfstests?

fstests@vger.kernel.org

> Sorry for the silly questions. :P

You're going to have many more of them when you start moving the
tests across to fstests :P

Cheers,

Dave.
David Laight Oct. 1, 2018, 1:28 p.m. UTC | #15
From: Aleksa Sarai
> Sent: 29 September 2018 11:35
> 
> The need for some sort of control over VFS's path resolution (to avoid
> malicious paths resulting in inadvertent breakouts) has been a very
> long-standing desire of many userspace applications. This patchset is a
> revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
> 
> The most obvious change is that AT_NO_JUMPS has been split as dicussed
> in the original thread, along with a further split of AT_NO_PROCLINKS
> which means that each individual property of AT_NO_JUMPS is now a
> separate flag:
> 
>   * Path-based escapes from the starting-point using "/" or ".." are
>     blocked by AT_BENEATH.

You may need to allow absolute paths that refer to items inside
the controlled area.
(Even if done by a textual replacement based on the expected name
of the base directory.)

>   * Mountpoint crossings are blocked by AT_XDEV.

You might want a mountpoint flag that allows crossing into the mounted
filesystem (you may need to get out in order to do pwd()).

>   * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
> 	correctly it actually blocks any user of nd_jump_link() because it
> 	allows out-of-VFS path resolution manipulation).

Or 'fix' the /proc/$pid/fd/$fd code to open the actual vnode rather than
being a symlink (although this might still let you get a directory vnode).
FWIW this is what NetBSD does - you can link the open file back into
the filesystem!

> 
> AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At
> Linus' suggestion in the original thread, I've also implemented
> AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
> "proclink" resolution).

What about allowing 'trivial' symlinks?

...
> Currently I've only enabled these for openat(2) and the stat(2) family.
> I would hope we could enable it for basically every *at(2) syscall --
> but many of them appear to not have a @flags argument and thus we'll
> need to add several new syscalls to do this. I'm more than happy to send
> those patches, but I'd prefer to know that this preliminary work is
> acceptable before doing a bunch of copy-paste to add new sets of *at(2)
> syscalls.

If you make the flags a property of the directory vnode (perhaps as
well as any syscall flags), and make it inherited by vnode lookup then
it can be used to stop library functions (or entire binaries) using
blocked paths.
You'd then only need to add an fcntl() call to set the flags (but never
clear them) to get the restriction applied to every lookup.
...

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Aleksa Sarai Oct. 1, 2018, 4:15 p.m. UTC | #16
On 2018-10-01, David Laight <David.Laight@ACULAB.COM> wrote:
> > The need for some sort of control over VFS's path resolution (to avoid
> > malicious paths resulting in inadvertent breakouts) has been a very
> > long-standing desire of many userspace applications. This patchset is a
> > revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
> > 
> > The most obvious change is that AT_NO_JUMPS has been split as dicussed
> > in the original thread, along with a further split of AT_NO_PROCLINKS
> > which means that each individual property of AT_NO_JUMPS is now a
> > separate flag:
> > 
> >   * Path-based escapes from the starting-point using "/" or ".." are
> >     blocked by AT_BENEATH.
> 
> You may need to allow absolute paths that refer to items inside
> the controlled area.
> (Even if done by a textual replacement based on the expected name
> of the base directory.)

This is sort of what AT_THIS_ROOT does. I didn't want to include it for
AT_BENEATH because it would be just as contentious as AT_THIS_ROOT
currently is. :P

> >   * Mountpoint crossings are blocked by AT_XDEV.
> 
> You might want a mountpoint flag that allows crossing into the mounted
> filesystem (you may need to get out in order to do pwd()).

Like a mount flag? I'm not sure how I feel about that. The intention is
to allow for a process to have control over how path lookups are
handled, and tying it to a mount flag means that it's no longer entirely
up to the process.

> >   * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
> > 	correctly it actually blocks any user of nd_jump_link() because it
> > 	allows out-of-VFS path resolution manipulation).
> 
> Or 'fix' the /proc/$pid/fd/$fd code to open the actual vnode rather than
> being a symlink (although this might still let you get a directory vnode).
> FWIW this is what NetBSD does - you can link the open file back into
> the filesystem!

Isn't this how it works currently? The /proc/$pid/fd/$fd "symlinks" are
actually references to the underlying file (they can even escape a
pivot_root()) -- you can re-open them or do any number of other dodgy
things through /proc with them (we definitely abuse this in container
runtimes -- and I'm sure plenty of other people do as well).

> > AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At
> > Linus' suggestion in the original thread, I've also implemented
> > AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
> > "proclink" resolution).
> 
> What about allowing 'trivial' symlinks?

The use-case of AT_NO_SYMLINKS that Linus pitched[1] is that git wants
to have a unique name for every object and so allowing trivial symlinks
is a no-go. I assume "trivial" here means "no-'..' components"?

> > Currently I've only enabled these for openat(2) and the stat(2) family.
> > I would hope we could enable it for basically every *at(2) syscall --
> > but many of them appear to not have a @flags argument and thus we'll
> > need to add several new syscalls to do this. I'm more than happy to send
> > those patches, but I'd prefer to know that this preliminary work is
> > acceptable before doing a bunch of copy-paste to add new sets of *at(2)
> > syscalls.
> 
> If you make the flags a property of the directory vnode (perhaps as
> well as any syscall flags), and make it inherited by vnode lookup then
> it can be used to stop library functions (or entire binaries) using
> blocked paths.
> You'd then only need to add an fcntl() call to set the flags (but never
> clear them) to get the restriction applied to every lookup.

This seems like it might be useful, but it could always be done as a
follow-up patch by just setting LOOKUP_BLAH if the dirfd has the flag
set. I'm also a little bit concerned that (because fd flags are set on
the 'struct file') if you start sharing fds then you can no longer use
the lookup scoping for security (a racing process could remove the
flags while the management process resolves through it).
James Morris Oct. 1, 2018, 8:14 p.m. UTC | #17
On Mon, 1 Oct 2018, Mickaël Salaün wrote:

> Another way to apply a security policy could be to tied it to a file
> descriptor, similarly to Capsicum, which could enable to create
> programmable (real) capabilities. This way, it would be possible to
> "wrap" a file descriptor with a Landlock program and use it with
> FD-based syscalls or pass it to other processes. This would not require
> changes to the FS subsystem, but only the Landlock LSM code. This isn't
> done yet but I plan to add this new way to restrict operations on file
> descriptors.

Very interesting!

This could possibly be an LSM which stacks/integrates with other LSMs to 
enforce MAC of object capabilities.



> 
> Anyway, for the use case you mentioned, the AT_BENEATH flag(s) should be
> simple to use and enough for now. We must be careful of the hardcoded
> policy though.
> 
> 
> > 
> >> On 9/29/18 12:34, Aleksa Sarai wrote:
> >>> The need for some sort of control over VFS's path resolution (to avoid
> >>> malicious paths resulting in inadvertent breakouts) has been a very
> >>> long-standing desire of many userspace applications. This patchset is a
> >>> revival of Al Viro's old AT_NO_JUMPS[1] patchset with a few additions.
> >>>
> >>> The most obvious change is that AT_NO_JUMPS has been split as dicussed
> >>> in the original thread, along with a further split of AT_NO_PROCLINKS
> >>> which means that each individual property of AT_NO_JUMPS is now a
> >>> separate flag:
> >>>
> >>>   * Path-based escapes from the starting-point using "/" or ".." are
> >>>     blocked by AT_BENEATH.
> >>>   * Mountpoint crossings are blocked by AT_XDEV.
> >>>   * /proc/$pid/fd/$fd resolution is blocked by AT_NO_PROCLINKS (more
> >>>       correctly it actually blocks any user of nd_jump_link() because it
> >>>       allows out-of-VFS path resolution manipulation).
> >>>
> >>> AT_NO_JUMPS is now effectively (AT_BENEATH|AT_XDEV|AT_NO_PROCLINKS). At
> >>> Linus' suggestion in the original thread, I've also implemented
> >>> AT_NO_SYMLINKS which just denies _all_ symlink resolution (including
> >>> "proclink" resolution).
> >>>
> >>> An additional improvement was made to AT_XDEV. The original AT_NO_JUMPS
> >>> path didn't consider "/tmp/.." as a mountpoint crossing -- this patch
> >>> blocks this as well (feel free to ask me to remove it if you feel this
> >>> is not sane).
> >>>
> >>> Currently I've only enabled these for openat(2) and the stat(2) family.
> >>> I would hope we could enable it for basically every *at(2) syscall --
> >>> but many of them appear to not have a @flags argument and thus we'll
> >>> need to add several new syscalls to do this. I'm more than happy to send
> >>> those patches, but I'd prefer to know that this preliminary work is
> >>> acceptable before doing a bunch of copy-paste to add new sets of *at(2)
> >>> syscalls.
> >>>
> >>> One additional feature I've implemented is AT_THIS_ROOT (I imagine this
> >>> is probably going to be more contentious than the refresh of
> >>> AT_NO_JUMPS, so I've included it in a separate patch). The patch itself
> >>> describes my reasoning, but the shortened version of the premise is that
> >>> continer runtimes need to have a way to resolve paths within a
> >>> potentially malicious rootfs. Container runtimes currently do this in
> >>> userspace[2] which has implicit race conditions that are not resolvable
> >>> in userspace (or use fork+exec+chroot and SCM_RIGHTS passing which is
> >>> inefficient). AT_THIS_ROOT allows for per-call chroot-like semantics for
> >>> path resolution, which would be invaluable for us -- and the
> >>> implementation is basically identical to AT_BENEATH (except that we
> >>> don't return errors when someone actually hits the root).
> >>>
> >>> I've added some selftests for this, but it's not clear to me whether
> >>> they should live here or in xfstests (as far as I can tell there are no
> >>> other VFS tests in selftests, while there are some tests that look like
> >>> generic VFS tests in xfstests). If you'd prefer them to be included in
> >>> xfstests, let me know.
> >>>
> >>> [1]: https://lore.kernel.org/patchwork/patch/784221/
> >>> [2]: https://github.com/cyphar/filepath-securejoin
> >>>
> >>> Aleksa Sarai (3):
> >>>   namei: implement O_BENEATH-style AT_* flags
> >>>   namei: implement AT_THIS_ROOT chroot-like path resolution
> >>>   selftests: vfs: add AT_* path resolution tests
> >>>
> >>>  fs/fcntl.c                                    |   2 +-
> >>>  fs/namei.c                                    | 158 ++++++++++++------
> >>>  fs/open.c                                     |  10 ++
> >>>  fs/stat.c                                     |  15 +-
> >>>  include/linux/fcntl.h                         |   3 +-
> >>>  include/linux/namei.h                         |   8 +
> >>>  include/uapi/asm-generic/fcntl.h              |  20 +++
> >>>  include/uapi/linux/fcntl.h                    |  10 ++
> >>>  tools/testing/selftests/Makefile              |   1 +
> >>>  tools/testing/selftests/vfs/.gitignore        |   1 +
> >>>  tools/testing/selftests/vfs/Makefile          |  13 ++
> >>>  tools/testing/selftests/vfs/at_flags.h        |  40 +++++
> >>>  tools/testing/selftests/vfs/common.sh         |  37 ++++
> >>>  .../selftests/vfs/tests/0001_at_beneath.sh    |  72 ++++++++
> >>>  .../selftests/vfs/tests/0002_at_xdev.sh       |  54 ++++++
> >>>  .../vfs/tests/0003_at_no_proclinks.sh         |  50 ++++++
> >>>  .../vfs/tests/0004_at_no_symlinks.sh          |  49 ++++++
> >>>  .../selftests/vfs/tests/0005_at_this_root.sh  |  66 ++++++++
> >>>  tools/testing/selftests/vfs/vfs_helper.c      | 154 +++++++++++++++++
> >>>  19 files changed, 707 insertions(+), 56 deletions(-)
> >>>  create mode 100644 tools/testing/selftests/vfs/.gitignore
> >>>  create mode 100644 tools/testing/selftests/vfs/Makefile
> >>>  create mode 100644 tools/testing/selftests/vfs/at_flags.h
> >>>  create mode 100644 tools/testing/selftests/vfs/common.sh
> >>>  create mode 100755 tools/testing/selftests/vfs/tests/0001_at_beneath.sh
> >>>  create mode 100755 tools/testing/selftests/vfs/tests/0002_at_xdev.sh
> >>>  create mode 100755 tools/testing/selftests/vfs/tests/0003_at_no_proclinks.sh
> >>>  create mode 100755 tools/testing/selftests/vfs/tests/0004_at_no_symlinks.sh
> >>>  create mode 100755 tools/testing/selftests/vfs/tests/0005_at_this_root.sh
> >>>  create mode 100644 tools/testing/selftests/vfs/vfs_helper.c
> >>>
> >>
> > 
> > 
> 
>
David Laight Oct. 3, 2018, 1:21 p.m. UTC | #18
From: Aleksa Sarai
> Sent: 01 October 2018 17:16
> 
> On 2018-10-01, David Laight <David.Laight@ACULAB.COM> wrote:
...
> > >   * Mountpoint crossings are blocked by AT_XDEV.
> >
> > You might want a mountpoint flag that allows crossing into the mounted
> > filesystem (you may need to get out in order to do pwd()).
> 
> Like a mount flag? I'm not sure how I feel about that. The intention is
> to allow for a process to have control over how path lookups are
> handled, and tying it to a mount flag means that it's no longer entirely
> up to the process.

Right, but you may have some mount points that you don't want to cross
and others that it is perfectly fine to cross.
For example you might want to be able to cross into a 'tmp' filesystem.

...
> > If you make the flags a property of the directory vnode (perhaps as
> > well as any syscall flags), and make it inherited by vnode lookup then
> > it can be used to stop library functions (or entire binaries) using
> > blocked paths.
> > You'd then only need to add an fcntl() call to set the flags (but never
> > clear them) to get the restriction applied to every lookup.
> 
> This seems like it might be useful, but it could always be done as a
> follow-up patch by just setting LOOKUP_BLAH if the dirfd has the flag
> set. I'm also a little bit concerned that (because fd flags are set on
> the 'struct file') if you start sharing fds then you can no longer use
> the lookup scoping for security (a racing process could remove the
> flags while the management process resolves through it).

I was thinking that the flags would never be removable.
A management process might have to flip its cwd back and forth
in order to clear the flags (opendir(".") should give a different
struct file).

This all gets tied up with the slight requirement for per-thread cwd.

I had another thought that the crudentials structure used for a file
lookup could also be taken from the cwd (not sure how it would
get there - especially if you need the correct group list).
That would allow a 'management' process to open a file in the context
of the target user process.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)