diff mbox series

[v1] memfd: `MFD_NOEXEC_SEAL` should not imply `MFD_ALLOW_SEALING`

Message ID 20240513191544.94754-1-pobrn@protonmail.com (mailing list archive)
State New
Headers show
Series [v1] memfd: `MFD_NOEXEC_SEAL` should not imply `MFD_ALLOW_SEALING` | expand

Commit Message

Barnabás Pőcze May 13, 2024, 7:15 p.m. UTC
`MFD_NOEXEC_SEAL` should remove the executable bits and set
`F_SEAL_EXEC` to prevent further modifications to the executable
bits as per the comment in the uapi header file:

  not executable and sealed to prevent changing to executable

However, currently, it also unsets `F_SEAL_SEAL`, essentially
acting as a superset of `MFD_ALLOW_SEALING`. Nothing implies
that it should be so, and indeed up until the second version
of the of the patchset[0] that introduced `MFD_EXEC` and
`MFD_NOEXEC_SEAL`, `F_SEAL_SEAL` was not removed, however it
was changed in the third revision of the patchset[1] without
a clear explanation.

This behaviour is suprising for application developers,
there is no documentation that would reveal that `MFD_NOEXEC_SEAL`
has the additional effect of `MFD_ALLOW_SEALING`.

So do not remove `F_SEAL_SEAL` when `MFD_NOEXEC_SEAL` is requested.
This is technically an ABI break, but it seems very unlikely that an
application would depend on this behaviour (unless by accident).

[0]: https://lore.kernel.org/lkml/20220805222126.142525-3-jeffxu@google.com/
[1]: https://lore.kernel.org/lkml/20221202013404.163143-3-jeffxu@google.com/

Fixes: 105ff5339f498a ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Barnabás Pőcze <pobrn@protonmail.com>
---

Or did I miss the explanation as to why MFD_NOEXEC_SEAL should
imply MFD_ALLOW_SEALING? If so, please direct me to it and
sorry for the noise.

---
 mm/memfd.c                                 | 9 ++++-----
 tools/testing/selftests/memfd/memfd_test.c | 2 +-
 2 files changed, 5 insertions(+), 6 deletions(-)

Comments

Jeff Xu May 16, 2024, 6:11 a.m. UTC | #1
On Mon, May 13, 2024 at 12:15 PM Barnabás Pőcze <pobrn@protonmail.com> wrote:
>
> `MFD_NOEXEC_SEAL` should remove the executable bits and set
> `F_SEAL_EXEC` to prevent further modifications to the executable
> bits as per the comment in the uapi header file:
>
>   not executable and sealed to prevent changing to executable
>
> However, currently, it also unsets `F_SEAL_SEAL`, essentially
> acting as a superset of `MFD_ALLOW_SEALING`. Nothing implies
> that it should be so, and indeed up until the second version
> of the of the patchset[0] that introduced `MFD_EXEC` and
> `MFD_NOEXEC_SEAL`, `F_SEAL_SEAL` was not removed, however it
> was changed in the third revision of the patchset[1] without
> a clear explanation.
>
> This behaviour is suprising for application developers,
> there is no documentation that would reveal that `MFD_NOEXEC_SEAL`
> has the additional effect of `MFD_ALLOW_SEALING`.
>
Ya, I agree that there should be documentation, such as a man page. I will
work on that.

> So do not remove `F_SEAL_SEAL` when `MFD_NOEXEC_SEAL` is requested.
> This is technically an ABI break, but it seems very unlikely that an
> application would depend on this behaviour (unless by accident).
>
> [0]: https://lore.kernel.org/lkml/20220805222126.142525-3-jeffxu@google.com/
> [1]: https://lore.kernel.org/lkml/20221202013404.163143-3-jeffxu@google.com/
>
> Fixes: 105ff5339f498a ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
> Signed-off-by: Barnabás Pőcze <pobrn@protonmail.com>
> ---
>
> Or did I miss the explanation as to why MFD_NOEXEC_SEAL should
> imply MFD_ALLOW_SEALING? If so, please direct me to it and
> sorry for the noise.
>
Previously I might be thinking  MFD_NOEXEC_SEAL implies
MFD_ALLOW_SEALING because MFD_NOEXEC_SEAL seals F_SEAL_EXEC, and
sealing is added only when MFD_ALLOW_SEALING is set.

I agree your patch handles this better, e.g.
mfd_create(MFD_NOEXEC_SEAL) will have F_SEAL_SEAL and F_SEAL_EXEC
mfd_create(MFD_NOEXEC_SEAL|MFD_ALLOW_SEALING) will have F_SEAL_EXEC


> ---
>  mm/memfd.c                                 | 9 ++++-----
>  tools/testing/selftests/memfd/memfd_test.c | 2 +-
>  2 files changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 7d8d3ab3fa37..8b7f6afee21d 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -356,12 +356,11 @@ SYSCALL_DEFINE2(memfd_create,
>
>                 inode->i_mode &= ~0111;
>                 file_seals = memfd_file_seals_ptr(file);
> -               if (file_seals) {
> -                       *file_seals &= ~F_SEAL_SEAL;
> +               if (file_seals)
>                         *file_seals |= F_SEAL_EXEC;
> -               }
> -       } else if (flags & MFD_ALLOW_SEALING) {
> -               /* MFD_EXEC and MFD_ALLOW_SEALING are set */
> +       }
> +
> +       if (flags & MFD_ALLOW_SEALING) {
>                 file_seals = memfd_file_seals_ptr(file);
>                 if (file_seals)
>                         *file_seals &= ~F_SEAL_SEAL;
> diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
> index 18f585684e20..b6a7ad68c3c1 100644
> --- a/tools/testing/selftests/memfd/memfd_test.c
> +++ b/tools/testing/selftests/memfd/memfd_test.c
> @@ -1151,7 +1151,7 @@ static void test_noexec_seal(void)
>                             mfd_def_size,
>                             MFD_CLOEXEC | MFD_NOEXEC_SEAL);
>         mfd_assert_mode(fd, 0666);
> -       mfd_assert_has_seals(fd, F_SEAL_EXEC);
> +       mfd_assert_has_seals(fd, F_SEAL_SEAL | F_SEAL_EXEC);
>         mfd_fail_chmod(fd, 0777);
>         close(fd);
>  }
> --
> 2.45.0
>

Reviewed-by: Jeff Xu <jeffxu@google.com>

Thanks!
-Jeff
Andrew Morton May 22, 2024, 11:23 p.m. UTC | #2
On Wed, 15 May 2024 23:11:12 -0700 Jeff Xu <jeffxu@google.com> wrote:

> On Mon, May 13, 2024 at 12:15 PM Barnabás Pőcze <pobrn@protonmail.com> wrote:
> >
> > `MFD_NOEXEC_SEAL` should remove the executable bits and set
> > `F_SEAL_EXEC` to prevent further modifications to the executable
> > bits as per the comment in the uapi header file:
> >
> >   not executable and sealed to prevent changing to executable
> >
> > However, currently, it also unsets `F_SEAL_SEAL`, essentially
> > acting as a superset of `MFD_ALLOW_SEALING`. Nothing implies
> > that it should be so, and indeed up until the second version
> > of the of the patchset[0] that introduced `MFD_EXEC` and
> > `MFD_NOEXEC_SEAL`, `F_SEAL_SEAL` was not removed, however it
> > was changed in the third revision of the patchset[1] without
> > a clear explanation.
> >
> > This behaviour is suprising for application developers,
> > there is no documentation that would reveal that `MFD_NOEXEC_SEAL`
> > has the additional effect of `MFD_ALLOW_SEALING`.
> >
> Ya, I agree that there should be documentation, such as a man page. I will
> work on that.
> 
> > So do not remove `F_SEAL_SEAL` when `MFD_NOEXEC_SEAL` is requested.
> > This is technically an ABI break, but it seems very unlikely that an
> > application would depend on this behaviour (unless by accident).
> >
> > [0]: https://lore.kernel.org/lkml/20220805222126.142525-3-jeffxu@google.com/
> > [1]: https://lore.kernel.org/lkml/20221202013404.163143-3-jeffxu@google.com/
> 
> ...
>
> Reviewed-by: Jeff Xu <jeffxu@google.com>

It's a change to a userspace API, yes?  Please let's have a detailed
description of why this is OK.  Why it won't affect any existing users.

Also, please let's give consideration to a -stable backport so that all
kernel versions will eventually behave in the same manner.
Barnabás Pőcze May 23, 2024, 2:25 a.m. UTC | #3
Hi


2024. május 23., csütörtök 1:23 keltezéssel, Andrew Morton <akpm@linux-foundation.org> írta:

> On Wed, 15 May 2024 23:11:12 -0700 Jeff Xu <jeffxu@google.com> wrote:
> 
> > On Mon, May 13, 2024 at 12:15 PM Barnabás Pőcze <pobrn@protonmail.com> wrote:
> > >
> > > `MFD_NOEXEC_SEAL` should remove the executable bits and set
> > > `F_SEAL_EXEC` to prevent further modifications to the executable
> > > bits as per the comment in the uapi header file:
> > >
> > >   not executable and sealed to prevent changing to executable
> > >
> > > However, currently, it also unsets `F_SEAL_SEAL`, essentially
> > > acting as a superset of `MFD_ALLOW_SEALING`. Nothing implies
> > > that it should be so, and indeed up until the second version
> > > of the of the patchset[0] that introduced `MFD_EXEC` and
> > > `MFD_NOEXEC_SEAL`, `F_SEAL_SEAL` was not removed, however it
> > > was changed in the third revision of the patchset[1] without
> > > a clear explanation.
> > >
> > > This behaviour is suprising for application developers,
> > > there is no documentation that would reveal that `MFD_NOEXEC_SEAL`
> > > has the additional effect of `MFD_ALLOW_SEALING`.
> > >
> > Ya, I agree that there should be documentation, such as a man page. I will
> > work on that.
> >
> > > So do not remove `F_SEAL_SEAL` when `MFD_NOEXEC_SEAL` is requested.
> > > This is technically an ABI break, but it seems very unlikely that an
> > > application would depend on this behaviour (unless by accident).
> > >
> > > [0]: https://lore.kernel.org/lkml/20220805222126.142525-3-jeffxu@google.com/
> > > [1]: https://lore.kernel.org/lkml/20221202013404.163143-3-jeffxu@google.com/
> >
> > ...
> >
> > Reviewed-by: Jeff Xu <jeffxu@google.com>
> 
> It's a change to a userspace API, yes?  Please let's have a detailed
> description of why this is OK.  Why it won't affect any existing users.

Yes, it is a uAPI change. To trigger user visible change, a program has to

 - create a memfd
   - with MFD_NOEXEC_SEAL,
   - without MFD_ALLOW_SEALING;
 - try to add seals / check the seals.

This change in essence reverts the kernel's behaviour to that of Linux <6.3, where
only `MFD_ALLOW_SEALING` enabled sealing. If a program works correctly on those
kernels, it will likely work correctly after this change.

I have looked through Debian Code Search and GitHub, searching for `MFD_NOEXEC_SEAL`.
And I could find only a single breakage that this change would case: dbus-broker
has its own memfd_create() wrapper that is aware of this implicit `MFD_ALLOW_SEALING`
behaviour[0], and tries to work around it. This workaround will break. Luckily,
however, as far as I could tell this only affects the test suite of dbus-broker,
not its normal operations, so I believe it should be fine. I have prepared a PR
with a fix[1].


> 
> Also, please let's give consideration to a -stable backport so that all
> kernel versions will eventually behave in the same manner.
> 
> 

I think that is a good idea, should I resend this with the `Cc: stable@...` tag or
what should I do?


Regards,
Barnabás Pőcze


[0]: https://github.com/bus1/dbus-broker/blob/9eb0b7e5826fc76cad7b025bc46f267d4a8784cb/src/util/misc.c#L114
[1]: https://github.com/bus1/dbus-broker/pull/366
Jeff Xu May 23, 2024, 2:32 a.m. UTC | #4
On Wed, May 22, 2024 at 4:23 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 15 May 2024 23:11:12 -0700 Jeff Xu <jeffxu@google.com> wrote:
>
> > On Mon, May 13, 2024 at 12:15 PM Barnabás Pőcze <pobrn@protonmail.com> wrote:
> > >
> > > `MFD_NOEXEC_SEAL` should remove the executable bits and set
> > > `F_SEAL_EXEC` to prevent further modifications to the executable
> > > bits as per the comment in the uapi header file:
> > >
> > >   not executable and sealed to prevent changing to executable
> > >
> > > However, currently, it also unsets `F_SEAL_SEAL`, essentially
> > > acting as a superset of `MFD_ALLOW_SEALING`. Nothing implies
> > > that it should be so, and indeed up until the second version
> > > of the of the patchset[0] that introduced `MFD_EXEC` and
> > > `MFD_NOEXEC_SEAL`, `F_SEAL_SEAL` was not removed, however it
> > > was changed in the third revision of the patchset[1] without
> > > a clear explanation.
> > >
> > > This behaviour is suprising for application developers,
> > > there is no documentation that would reveal that `MFD_NOEXEC_SEAL`
> > > has the additional effect of `MFD_ALLOW_SEALING`.
> > >
> > Ya, I agree that there should be documentation, such as a man page. I will
> > work on that.
> >
> > > So do not remove `F_SEAL_SEAL` when `MFD_NOEXEC_SEAL` is requested.
> > > This is technically an ABI break, but it seems very unlikely that an
> > > application would depend on this behaviour (unless by accident).
> > >
> > > [0]: https://lore.kernel.org/lkml/20220805222126.142525-3-jeffxu@google.com/
> > > [1]: https://lore.kernel.org/lkml/20221202013404.163143-3-jeffxu@google.com/
> >
> > ...
> >
> > Reviewed-by: Jeff Xu <jeffxu@google.com>
>
> It's a change to a userspace API, yes?  Please let's have a detailed
> description of why this is OK.  Why it won't affect any existing users.
>
Unfortunately, this is a breaking change that might break a
application if they do below:
memfd_create("", MFD_NOEXEC_SEAL)
fcntl(fd, F_ADD_SEALS, F_SEAL_WRITE); <-- this will fail in new
semantics, due to mfd not being sealable.

However, I still think the new semantics is a better, the reason is
due to  the sysctl: memfd_noexec_scope
Currently, when the sysctl  is set to MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL
kernel adds MFD_NOEXEC_SEAL to memfd_create, and the memfd  becomes sealable.
E.g.
When the sysctl is set to MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL
The app calls memfd_create("",0)
application will get sealable memfd, which might be a surprise to application.

If the app doesn't want this behavior, they will need one of two below
in current implementation.
1>
set the sysctl: memfd_noexec_scope to 0.
So the kernel doesn't overwrite the mdmfd_create

2>
modify their code  to get non-sealable NOEXEC memfd.
memfd_create("", MEMFD_NOEXEC_SCOPE_NOEXEC)
fcntl(fd, F_ADD_SEALS, F_SEAL_SEAL)

The new semantics works better with the sysctl.

Since memfd noexec is new, maybe there is no application using the
MFD_NOEXEC_SEAL to create
sealable memfd. They mostly likely use
memfd(MFD_NOEXEC_SEAL|MFD_ALLOW_SEALING) instead.
I think it might benefit in the long term with the new semantics.

If breaking change is not recommended,  the alternative is to
introduce a new flag.
MFD_NOEXEC_SEAL_SEAL. (I can't find a better name...)

What do you think ?

> Also, please let's give consideration to a -stable backport so that all
> kernel versions will eventually behave in the same manner.
>
Yes. If the new semantics is acceptable, backport is needed as bugfix
to all kernel versions.
I can do that if someone helps me with the process.

And sorry about this bug that I created.
Jeff Xu May 23, 2024, 2:40 a.m. UTC | #5
On Wed, May 22, 2024 at 7:25 PM Barnabás Pőcze <pobrn@protonmail.com> wrote:
>
> Hi
>
>
> 2024. május 23., csütörtök 1:23 keltezéssel, Andrew Morton <akpm@linux-foundation.org> írta:
>
> > On Wed, 15 May 2024 23:11:12 -0700 Jeff Xu <jeffxu@google.com> wrote:
> >
> > > On Mon, May 13, 2024 at 12:15 PM Barnabás Pőcze <pobrn@protonmail.com> wrote:
> > > >
> > > > `MFD_NOEXEC_SEAL` should remove the executable bits and set
> > > > `F_SEAL_EXEC` to prevent further modifications to the executable
> > > > bits as per the comment in the uapi header file:
> > > >
> > > >   not executable and sealed to prevent changing to executable
> > > >
> > > > However, currently, it also unsets `F_SEAL_SEAL`, essentially
> > > > acting as a superset of `MFD_ALLOW_SEALING`. Nothing implies
> > > > that it should be so, and indeed up until the second version
> > > > of the of the patchset[0] that introduced `MFD_EXEC` and
> > > > `MFD_NOEXEC_SEAL`, `F_SEAL_SEAL` was not removed, however it
> > > > was changed in the third revision of the patchset[1] without
> > > > a clear explanation.
> > > >
> > > > This behaviour is suprising for application developers,
> > > > there is no documentation that would reveal that `MFD_NOEXEC_SEAL`
> > > > has the additional effect of `MFD_ALLOW_SEALING`.
> > > >
> > > Ya, I agree that there should be documentation, such as a man page. I will
> > > work on that.
> > >
> > > > So do not remove `F_SEAL_SEAL` when `MFD_NOEXEC_SEAL` is requested.
> > > > This is technically an ABI break, but it seems very unlikely that an
> > > > application would depend on this behaviour (unless by accident).
> > > >
> > > > [0]: https://lore.kernel.org/lkml/20220805222126.142525-3-jeffxu@google.com/
> > > > [1]: https://lore.kernel.org/lkml/20221202013404.163143-3-jeffxu@google.com/
> > >
> > > ...
> > >
> > > Reviewed-by: Jeff Xu <jeffxu@google.com>
> >
> > It's a change to a userspace API, yes?  Please let's have a detailed
> > description of why this is OK.  Why it won't affect any existing users.
>
> Yes, it is a uAPI change. To trigger user visible change, a program has to
>
>  - create a memfd
>    - with MFD_NOEXEC_SEAL,
>    - without MFD_ALLOW_SEALING;
>  - try to add seals / check the seals.
>
> This change in essence reverts the kernel's behaviour to that of Linux <6.3, where
> only `MFD_ALLOW_SEALING` enabled sealing. If a program works correctly on those
> kernels, it will likely work correctly after this change.
>
I agree with this.

The current memfd_test.c doesn't have good coverage sealable vs not_seable,
most tests are created with MFD_ALLOW_SEALING
I think the test_sysctl_set_sysctl0/1/2 need to add  cases for
no-sealable memfd.
because the change will also change the behavior of  the sysctl.
Do you want to add them as part of the patch ?


> I have looked through Debian Code Search and GitHub, searching for `MFD_NOEXEC_SEAL`.
> And I could find only a single breakage that this change would case: dbus-broker
> has its own memfd_create() wrapper that is aware of this implicit `MFD_ALLOW_SEALING`
> behaviour[0], and tries to work around it. This workaround will break. Luckily,
> however, as far as I could tell this only affects the test suite of dbus-broker,
> not its normal operations, so I believe it should be fine. I have prepared a PR
> with a fix[1].
>
Thanks for the investigation.

>
> >
> > Also, please let's give consideration to a -stable backport so that all
> > kernel versions will eventually behave in the same manner.
> >
> >
>
> I think that is a good idea, should I resend this with the `Cc: stable@...` tag or
> what should I do?
>
>
> Regards,
> Barnabás Pőcze
>
>
> [0]: https://github.com/bus1/dbus-broker/blob/9eb0b7e5826fc76cad7b025bc46f267d4a8784cb/src/util/misc.c#L114
> [1]: https://github.com/bus1/dbus-broker/pull/366
David Rheinsberg May 23, 2024, 8:24 a.m. UTC | #6
Hi

On Thu, May 23, 2024, at 4:25 AM, Barnabás Pőcze wrote:
> 2024. május 23., csütörtök 1:23 keltezéssel, Andrew Morton 
> <akpm@linux-foundation.org> írta:
>> It's a change to a userspace API, yes?  Please let's have a detailed
>> description of why this is OK.  Why it won't affect any existing users.
>
> Yes, it is a uAPI change. To trigger user visible change, a program has to
>
>  - create a memfd
>    - with MFD_NOEXEC_SEAL,
>    - without MFD_ALLOW_SEALING;
>  - try to add seals / check the seals.
>
> This change in essence reverts the kernel's behaviour to that of Linux 
> <6.3, where
> only `MFD_ALLOW_SEALING` enabled sealing. If a program works correctly 
> on those
> kernels, it will likely work correctly after this change.
>
> I have looked through Debian Code Search and GitHub, searching for 
> `MFD_NOEXEC_SEAL`.
> And I could find only a single breakage that this change would case: 
> dbus-broker
> has its own memfd_create() wrapper that is aware of this implicit 
> `MFD_ALLOW_SEALING`
> behaviour[0], and tries to work around it. This workaround will break. 
> Luckily,
> however, as far as I could tell this only affects the test suite of 
> dbus-broker,
> not its normal operations, so I believe it should be fine. I have 
> prepared a PR
> with a fix[1].

We asked for exactly this fix before, so I very much support this. Our test-suite in `dbus-broker` merely verifies what the current kernel behavior is (just like the kernel selftests). I am certainly ok if the kernel breaks it. I will gladly adapt the test-suite.

Previous discussion was in:

    [PATCH] memfd: support MFD_NOEXEC alongside MFD_EXEC
    https://lore.kernel.org/lkml/20230714114753.170814-1-david@readahead.eu/

Note that this fix is particularly important in combination with `vm.memfd_noexec=2`, since this breaks existing user-space by enabling sealing on all memfds unconditionally. I also encourage backporting to stable kernels.

Reviewed-by: David Rheinsberg <david@readahead.eu>

Thanks
David
Jeff Xu May 23, 2024, 4:20 p.m. UTC | #7
On Thu, May 23, 2024 at 1:24 AM David Rheinsberg <david@readahead.eu> wrote:
>
> Hi
>
> On Thu, May 23, 2024, at 4:25 AM, Barnabás Pőcze wrote:
> > 2024. május 23., csütörtök 1:23 keltezéssel, Andrew Morton
> > <akpm@linux-foundation.org> írta:
> >> It's a change to a userspace API, yes?  Please let's have a detailed
> >> description of why this is OK.  Why it won't affect any existing users.
> >
> > Yes, it is a uAPI change. To trigger user visible change, a program has to
> >
> >  - create a memfd
> >    - with MFD_NOEXEC_SEAL,
> >    - without MFD_ALLOW_SEALING;
> >  - try to add seals / check the seals.
> >
> > This change in essence reverts the kernel's behaviour to that of Linux
> > <6.3, where
> > only `MFD_ALLOW_SEALING` enabled sealing. If a program works correctly
> > on those
> > kernels, it will likely work correctly after this change.
> >
> > I have looked through Debian Code Search and GitHub, searching for
> > `MFD_NOEXEC_SEAL`.
> > And I could find only a single breakage that this change would case:
> > dbus-broker
> > has its own memfd_create() wrapper that is aware of this implicit
> > `MFD_ALLOW_SEALING`
> > behaviour[0], and tries to work around it. This workaround will break.
> > Luckily,
> > however, as far as I could tell this only affects the test suite of
> > dbus-broker,
> > not its normal operations, so I believe it should be fine. I have
> > prepared a PR
> > with a fix[1].
>
> We asked for exactly this fix before, so I very much support this. Our test-suite in `dbus-broker` merely verifies what the current kernel behavior is (just like the kernel selftests). I am certainly ok if the kernel breaks it. I will gladly adapt the test-suite.
>
> Previous discussion was in:
>
>     [PATCH] memfd: support MFD_NOEXEC alongside MFD_EXEC
>     https://lore.kernel.org/lkml/20230714114753.170814-1-david@readahead.eu/
>
> Note that this fix is particularly important in combination with `vm.memfd_noexec=2`, since this breaks existing user-space by enabling sealing on all memfds unconditionally. I also encourage backporting to stable kernels.
>
Also with vm.memfd_noexec=1.
I think that problem must be addressed either with this patch, or with
a new flag.

Regarding vm.memfd_noexec, on another topic.
I think in addition to  vm.memfd_noexec = 1 and 2,  there still could
be another state: 3

=0. Do nothing.
=1. This will add MFD_NOEXEC_SEAL if application didn't set EXEC or
MFD_NOEXE_SEAL (to help with the migration)
=2: This will reject all calls without MFD_NOEXEC_SEAL (the whole
system doesn't allow executable memfd)
=3:  Application must set MFD_EXEC or MFD_NOEXEC_SEAL explicitly, or
else it will be rejected.

3 is useful because it lets applications choose what to use, and
forces applications to migrate to new semantics (this is what 2 did
before 9876cfe8).
The caveat is 3 is less restrictive than 2, so must document it clearly.

-Jeff

> Reviewed-by: David Rheinsberg <david@readahead.eu>
>
> Thanks
> David
Jeff Xu May 23, 2024, 4:55 p.m. UTC | #8
On Thu, May 23, 2024 at 9:20 AM Jeff Xu <jeffxu@google.com> wrote:
>
> On Thu, May 23, 2024 at 1:24 AM David Rheinsberg <david@readahead.eu> wrote:
> >
> > Hi
> >
> > On Thu, May 23, 2024, at 4:25 AM, Barnabás Pőcze wrote:
> > > 2024. május 23., csütörtök 1:23 keltezéssel, Andrew Morton
> > > <akpm@linux-foundation.org> írta:
> > >> It's a change to a userspace API, yes?  Please let's have a detailed
> > >> description of why this is OK.  Why it won't affect any existing users.
> > >
> > > Yes, it is a uAPI change. To trigger user visible change, a program has to
> > >
> > >  - create a memfd
> > >    - with MFD_NOEXEC_SEAL,
> > >    - without MFD_ALLOW_SEALING;
> > >  - try to add seals / check the seals.
> > >
> > > This change in essence reverts the kernel's behaviour to that of Linux
> > > <6.3, where
> > > only `MFD_ALLOW_SEALING` enabled sealing. If a program works correctly
> > > on those
> > > kernels, it will likely work correctly after this change.
> > >
> > > I have looked through Debian Code Search and GitHub, searching for
> > > `MFD_NOEXEC_SEAL`.
> > > And I could find only a single breakage that this change would case:
> > > dbus-broker
> > > has its own memfd_create() wrapper that is aware of this implicit
> > > `MFD_ALLOW_SEALING`
> > > behaviour[0], and tries to work around it. This workaround will break.
> > > Luckily,
> > > however, as far as I could tell this only affects the test suite of
> > > dbus-broker,
> > > not its normal operations, so I believe it should be fine. I have
> > > prepared a PR
> > > with a fix[1].
> >
> > We asked for exactly this fix before, so I very much support this. Our test-suite in `dbus-broker` merely verifies what the current kernel behavior is (just like the kernel selftests). I am certainly ok if the kernel breaks it. I will gladly adapt the test-suite.
> >
memfd is by default not sealable, and file is by default sealable,
right ? that makes the memfd  semantics different from other objects
in linux.
I wonder what is the original reason to have memfd  this way?

Another solution is to change memfd to be by-default sealable,
although that will be an api change, but what side effect  will it be
?
If we are worried about the memfd being sealed by an attacker, the
malicious code could also overwrite the content since memfd is not
sealed.

> > Previous discussion was in:
> >
> >     [PATCH] memfd: support MFD_NOEXEC alongside MFD_EXEC
> >     https://lore.kernel.org/lkml/20230714114753.170814-1-david@readahead.eu/
> >
> > Note that this fix is particularly important in combination with `vm.memfd_noexec=2`, since this breaks existing user-space by enabling sealing on all memfds unconditionally. I also encourage backporting to stable kernels.
> >
> Also with vm.memfd_noexec=1.
> I think that problem must be addressed either with this patch, or with
> a new flag.
>
> Regarding vm.memfd_noexec, on another topic.
> I think in addition to  vm.memfd_noexec = 1 and 2,  there still could
> be another state: 3
>
> =0. Do nothing.
> =1. This will add MFD_NOEXEC_SEAL if application didn't set EXEC or
> MFD_NOEXE_SEAL (to help with the migration)
> =2: This will reject all calls without MFD_NOEXEC_SEAL (the whole
> system doesn't allow executable memfd)
> =3:  Application must set MFD_EXEC or MFD_NOEXEC_SEAL explicitly, or
> else it will be rejected.
>
> 3 is useful because it lets applications choose what to use, and
> forces applications to migrate to new semantics (this is what 2 did
> before 9876cfe8).
> The caveat is 3 is less restrictive than 2, so must document it clearly.
>
> -Jeff
>
> > Reviewed-by: David Rheinsberg <david@readahead.eu>
> >
> > Thanks
> > David
Andrew Morton May 23, 2024, 7:45 p.m. UTC | #9
On Wed, 22 May 2024 19:32:35 -0700 Jeff Xu <jeffxu@google.com> wrote:

> >
> > It's a change to a userspace API, yes?  Please let's have a detailed
> > description of why this is OK.  Why it won't affect any existing users.
> >
> Unfortunately, this is a breaking change that might break a
> application if they do below:
> memfd_create("", MFD_NOEXEC_SEAL)
> fcntl(fd, F_ADD_SEALS, F_SEAL_WRITE); <-- this will fail in new
> semantics, due to mfd not being sealable.
> 
> However, I still think the new semantics is a better, the reason is
> due to  the sysctl: memfd_noexec_scope
> Currently, when the sysctl  is set to MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL
> kernel adds MFD_NOEXEC_SEAL to memfd_create, and the memfd  becomes sealable.
> E.g.
> When the sysctl is set to MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL
> The app calls memfd_create("",0)
> application will get sealable memfd, which might be a surprise to application.
> 
> If the app doesn't want this behavior, they will need one of two below
> in current implementation.
> 1>
> set the sysctl: memfd_noexec_scope to 0.
> So the kernel doesn't overwrite the mdmfd_create
> 
> 2>
> modify their code  to get non-sealable NOEXEC memfd.
> memfd_create("", MEMFD_NOEXEC_SCOPE_NOEXEC)
> fcntl(fd, F_ADD_SEALS, F_SEAL_SEAL)
> 
> The new semantics works better with the sysctl.
> 
> Since memfd noexec is new, maybe there is no application using the
> MFD_NOEXEC_SEAL to create
> sealable memfd. They mostly likely use
> memfd(MFD_NOEXEC_SEAL|MFD_ALLOW_SEALING) instead.
> I think it might benefit in the long term with the new semantics.

Yes, it's new so I expect any damage will be small.  Please prepare a
v2 which fully explains/justifies the thinking for this
non-backward-compatible change and which include the cc:stable.
Jeff Xu May 23, 2024, 8:44 p.m. UTC | #10
Hi Barnabás

Is that OK that I work on V2 ? It will be based on your V1 change and
I will also add more test cases.

Thanks
-Jeff

-

On Thu, May 23, 2024 at 12:45 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Wed, 22 May 2024 19:32:35 -0700 Jeff Xu <jeffxu@google.com> wrote:
>
> > >
> > > It's a change to a userspace API, yes?  Please let's have a detailed
> > > description of why this is OK.  Why it won't affect any existing users.
> > >
> > Unfortunately, this is a breaking change that might break a
> > application if they do below:
> > memfd_create("", MFD_NOEXEC_SEAL)
> > fcntl(fd, F_ADD_SEALS, F_SEAL_WRITE); <-- this will fail in new
> > semantics, due to mfd not being sealable.
> >
> > However, I still think the new semantics is a better, the reason is
> > due to  the sysctl: memfd_noexec_scope
> > Currently, when the sysctl  is set to MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL
> > kernel adds MFD_NOEXEC_SEAL to memfd_create, and the memfd  becomes sealable.
> > E.g.
> > When the sysctl is set to MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL
> > The app calls memfd_create("",0)
> > application will get sealable memfd, which might be a surprise to application.
> >
> > If the app doesn't want this behavior, they will need one of two below
> > in current implementation.
> > 1>
> > set the sysctl: memfd_noexec_scope to 0.
> > So the kernel doesn't overwrite the mdmfd_create
> >
> > 2>
> > modify their code  to get non-sealable NOEXEC memfd.
> > memfd_create("", MEMFD_NOEXEC_SCOPE_NOEXEC)
> > fcntl(fd, F_ADD_SEALS, F_SEAL_SEAL)
> >
> > The new semantics works better with the sysctl.
> >
> > Since memfd noexec is new, maybe there is no application using the
> > MFD_NOEXEC_SEAL to create
> > sealable memfd. They mostly likely use
> > memfd(MFD_NOEXEC_SEAL|MFD_ALLOW_SEALING) instead.
> > I think it might benefit in the long term with the new semantics.
>
> Yes, it's new so I expect any damage will be small.  Please prepare a
> v2 which fully explains/justifies the thinking for this
> non-backward-compatible change and which include the cc:stable.
>
>
Barnabás Pőcze May 23, 2024, 8:50 p.m. UTC | #11
2024. május 23., csütörtök 22:44 keltezéssel, Jeff Xu <jeffxu@google.com> írta:

> Hi Barnabás
> 
> Is that OK that I work on V2 ? It will be based on your V1 change and
> I will also add more test cases.

Sure, please go ahead. At the very end of this letter you'll find
the commit message that I would have sent in v2, maybe you can salvage
some of it.


Regards,
Barnabás Pőcze


> 
> Thanks
> -Jeff
> 
> -
> 
> On Thu, May 23, 2024 at 12:45 PM Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> > On Wed, 22 May 2024 19:32:35 -0700 Jeff Xu <jeffxu@google.com> wrote:
> >
> > > >
> > > > It's a change to a userspace API, yes?  Please let's have a detailed
> > > > description of why this is OK.  Why it won't affect any existing users.
> > > >
> > > Unfortunately, this is a breaking change that might break a
> > > application if they do below:
> > > memfd_create("", MFD_NOEXEC_SEAL)
> > > fcntl(fd, F_ADD_SEALS, F_SEAL_WRITE); <-- this will fail in new
> > > semantics, due to mfd not being sealable.
> > >
> > > However, I still think the new semantics is a better, the reason is
> > > due to  the sysctl: memfd_noexec_scope
> > > Currently, when the sysctl  is set to MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL
> > > kernel adds MFD_NOEXEC_SEAL to memfd_create, and the memfd  becomes sealable.
> > > E.g.
> > > When the sysctl is set to MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL
> > > The app calls memfd_create("",0)
> > > application will get sealable memfd, which might be a surprise to application.
> > >
> > > If the app doesn't want this behavior, they will need one of two below
> > > in current implementation.
> > > 1>
> > > set the sysctl: memfd_noexec_scope to 0.
> > > So the kernel doesn't overwrite the mdmfd_create
> > >
> > > 2>
> > > modify their code  to get non-sealable NOEXEC memfd.
> > > memfd_create("", MEMFD_NOEXEC_SCOPE_NOEXEC)
> > > fcntl(fd, F_ADD_SEALS, F_SEAL_SEAL)
> > >
> > > The new semantics works better with the sysctl.
> > >
> > > Since memfd noexec is new, maybe there is no application using the
> > > MFD_NOEXEC_SEAL to create
> > > sealable memfd. They mostly likely use
> > > memfd(MFD_NOEXEC_SEAL|MFD_ALLOW_SEALING) instead.
> > > I think it might benefit in the long term with the new semantics.
> >
> > Yes, it's new so I expect any damage will be small.  Please prepare a
> > v2 which fully explains/justifies the thinking for this
> > non-backward-compatible change and which include the cc:stable.
> >
> >
> 

---

memfd: `MFD_NOEXEC_SEAL` should not imply `MFD_ALLOW_SEALING`

`MFD_NOEXEC_SEAL` should remove the executable bits and
set `F_SEAL_EXEC` to prevent further modifications to the
executable bits as per the comment in the uapi header file:

  not executable and sealed to prevent changing to executable

However, currently, it also unsets `F_SEAL_SEAL`, essentially
acting as a superset of `MFD_ALLOW_SEALING`. Nothing implies that
it should be so, and indeed up until the second version of the of
the patchset[0] that introduced `MFD_EXEC` and `MFD_NOEXEC_SEAL`,
`F_SEAL_SEAL` was not removed, however it was changed in the
third revision of the patchset[1] without a clear explanation.

This behaviour is surprising for application developers, there
is no documentation that would reveal that `MFD_NOEXEC_SEAL`
has the additional effect of `MFD_ALLOW_SEALING`.
Additionally, combined with `vm.memfd_noexec=2` it has
the effect of making all memfds initially sealable.

So do not remove `F_SEAL_SEAL` when `MFD_NOEXEC_SEAL` is
requested, thereby returning to the pre-Linux 6.3 behaviour of
only allowing sealing when `MFD_ALLOW_SEALING` is specified.

Now, this is technically a uAPI break. However, the
damage is expected to be minimal. To trigger user
visible change, a program has to do the following steps:

 - create memfd:
   - with `MFD_NOEXEC_SEAL`,
   - without `MFD_ALLOW_SEALING`;
 - try to add seals / check the seals.

But that seems unlikely to happen intentionally since this
change essentially reverts the kernel's behaviour to that of
Linux <6.3, so if a program worked correctly on those older
kernels, it will likely work correctly after this change.

I have used Debian Code Search and GitHub to try to find potential
breakages, and I could only find a single one. dbus-broker's
memfd_create() wrapper is aware of this implicit `MFD_ALLOW_SEALING`
behaviour, and tries to work around it[2]. This workaround will
break. Luckily, this only affects the test suite, it does not affect
the normal operations of dbus-broker. There is a PR with a fix[3].

There was also a previous attempt to address
this peculiarity by introducing a new flag[4].

[0]: https://lore.kernel.org/lkml/20220805222126.142525-3-jeffxu@google.com/
[1]: https://lore.kernel.org/lkml/20221202013404.163143-3-jeffxu@google.com/
[2]: https://github.com/bus1/dbus-broker/blob/9eb0b7e5826fc76cad7b025bc46f267d4a8784cb/src/util/misc.c#L114
[3]: https://github.com/bus1/dbus-broker/pull/366
[4]: https://lore.kernel.org/lkml/20230714114753.170814-1-david@readahead.eu/

Cc: stable@vger.kernel.org
Fixes: 105ff5339f498a ("mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC")
Signed-off-by: Barnabás Pőcze <pobrn@protonmail.com>
Reviewed-by: Jeff Xu <jeffxu@google.com>
Reviewed-by: David Rheinsberg <david@readahead.eu>

---
David Rheinsberg May 24, 2024, 2:28 p.m. UTC | #12
Hi

On Thu, May 23, 2024, at 6:55 PM, Jeff Xu wrote:
> On Thu, May 23, 2024 at 9:20 AM Jeff Xu <jeffxu@google.com> wrote:
>> On Thu, May 23, 2024 at 1:24 AM David Rheinsberg <david@readahead.eu> wrote:
>> > We asked for exactly this fix before, so I very much support this. Our test-suite in `dbus-broker` merely verifies what the current kernel behavior is (just like the kernel selftests). I am certainly ok if the kernel breaks it. I will gladly adapt the test-suite.
>> >
> memfd is by default not sealable, and file is by default sealable,
> right ? that makes the memfd  semantics different from other objects
> in linux.
> I wonder what is the original reason to have memfd  this way?

shmem-files are *not* sealable by default. This design was followed for backward compatibility reasons, since shmem-files predate sealing and silently enabling sealing on all shmem-files would have broken existing users (see shmem.c which initializes seals to F_SEAL_SEAL).

I am not sure what you mean with "makes [memfd] semantics different from other objects in linux". Can you elaborate?

Since `memfd_create` was introduced at the same time as shmem-sealing, it could certainly have enabled sealing by default. Not sure whether this would be preferable, though.

> Another solution is to change memfd to be by-default sealable,
> although that will be an api change, but what side effect  will it be
> ?
> If we are worried about the memfd being sealed by an attacker, the
> malicious code could also overwrite the content since memfd is not
> sealed.

You cannot change the default-seals retrospectively. There are existing shmem-users that share file-descriptors and *expect* the other party to be able to override data, but do *not* expect the other party to be able to apply seals. Note that these models explicitly *want* shared, writable access to the buffer (e.g., render-client shares a buffer with the display server for scanout), so just because you can *write* to a shmem-file does not mean that sharing is unsafe (e.g., using SIGBUS+mmap can safely deal with page-faults).

Thanks
David
Aleksa Sarai May 24, 2024, 4:12 p.m. UTC | #13
On 2024-05-23, Jeff Xu <jeffxu@google.com> wrote:
> On Thu, May 23, 2024 at 1:24 AM David Rheinsberg <david@readahead.eu> wrote:
> >
> > Hi
> >
> > On Thu, May 23, 2024, at 4:25 AM, Barnabás Pőcze wrote:
> > > 2024. május 23., csütörtök 1:23 keltezéssel, Andrew Morton
> > > <akpm@linux-foundation.org> írta:
> > >> It's a change to a userspace API, yes?  Please let's have a detailed
> > >> description of why this is OK.  Why it won't affect any existing users.
> > >
> > > Yes, it is a uAPI change. To trigger user visible change, a program has to
> > >
> > >  - create a memfd
> > >    - with MFD_NOEXEC_SEAL,
> > >    - without MFD_ALLOW_SEALING;
> > >  - try to add seals / check the seals.
> > >
> > > This change in essence reverts the kernel's behaviour to that of Linux
> > > <6.3, where
> > > only `MFD_ALLOW_SEALING` enabled sealing. If a program works correctly
> > > on those
> > > kernels, it will likely work correctly after this change.
> > >
> > > I have looked through Debian Code Search and GitHub, searching for
> > > `MFD_NOEXEC_SEAL`.
> > > And I could find only a single breakage that this change would case:
> > > dbus-broker
> > > has its own memfd_create() wrapper that is aware of this implicit
> > > `MFD_ALLOW_SEALING`
> > > behaviour[0], and tries to work around it. This workaround will break.
> > > Luckily,
> > > however, as far as I could tell this only affects the test suite of
> > > dbus-broker,
> > > not its normal operations, so I believe it should be fine. I have
> > > prepared a PR
> > > with a fix[1].
> >
> > We asked for exactly this fix before, so I very much support this. Our test-suite in `dbus-broker` merely verifies what the current kernel behavior is (just like the kernel selftests). I am certainly ok if the kernel breaks it. I will gladly adapt the test-suite.
> >
> > Previous discussion was in:
> >
> >     [PATCH] memfd: support MFD_NOEXEC alongside MFD_EXEC
> >     https://lore.kernel.org/lkml/20230714114753.170814-1-david@readahead.eu/
> >
> > Note that this fix is particularly important in combination with `vm.memfd_noexec=2`, since this breaks existing user-space by enabling sealing on all memfds unconditionally. I also encourage backporting to stable kernels.
> >
> Also with vm.memfd_noexec=1.
> I think that problem must be addressed either with this patch, or with
> a new flag.
> 
> Regarding vm.memfd_noexec, on another topic.
> I think in addition to  vm.memfd_noexec = 1 and 2,  there still could
> be another state: 3
> 
> =0. Do nothing.
> =1. This will add MFD_NOEXEC_SEAL if application didn't set EXEC or
> MFD_NOEXE_SEAL (to help with the migration)
> =2: This will reject all calls without MFD_NOEXEC_SEAL (the whole
> system doesn't allow executable memfd)
> =3:  Application must set MFD_EXEC or MFD_NOEXEC_SEAL explicitly, or
> else it will be rejected.
> 
> 3 is useful because it lets applications choose what to use, and
> forces applications to migrate to new semantics (this is what 2 did
> before 9876cfe8).
> The caveat is 3 is less restrictive than 2, so must document it clearly.

As discussed at the time, "you must use this flag" is not a useful
setting for a general purpose operating system because it explicitly
disables backwards compatibility (breaking any application that was
written in the past 10 years!) for no reason other than "new is better".

As I suggested when we fixed the semantics of vm.memfd_noexec, if you
really want to block a particular flag from not being set, seccomp lets
you do this incredibly easily without acting as a footgun for admins.
Yes, vm.memfd_noexec can break programs that use executable memfds, but
that is the point of the sysctl -- making vm.memfd_noexec break programs
that don't use executable memfds (they are only guilty of being written
before mid-2023) is not useful.

In addition, making 3 less restrictive than 2 would make the original
restriction mechanism useless. A malicious process could raise the
setting to 3 and disable the "protection" (as discussed before, I really
don't understand the threat model here, but making it possible to
disable easily is pretty clearly). You could change the policy, but now
you're adding more complexity for a feature that IMO doesn't really make
sense in the first place.

> -Jeff
> 
> > Reviewed-by: David Rheinsberg <david@readahead.eu>
> >
> > Thanks
> > David
Jeff Xu May 28, 2024, 5:13 p.m. UTC | #14
On Fri, May 24, 2024 at 7:29 AM David Rheinsberg <david@readahead.eu> wrote:
>
> Hi
>
> On Thu, May 23, 2024, at 6:55 PM, Jeff Xu wrote:
> > On Thu, May 23, 2024 at 9:20 AM Jeff Xu <jeffxu@google.com> wrote:
> >> On Thu, May 23, 2024 at 1:24 AM David Rheinsberg <david@readahead.eu> wrote:
> >> > We asked for exactly this fix before, so I very much support this. Our test-suite in `dbus-broker` merely verifies what the current kernel behavior is (just like the kernel selftests). I am certainly ok if the kernel breaks it. I will gladly adapt the test-suite.
> >> >
> > memfd is by default not sealable, and file is by default sealable,
> > right ? that makes the memfd  semantics different from other objects
> > in linux.
> > I wonder what is the original reason to have memfd  this way?
>
> shmem-files are *not* sealable by default. This design was followed for backward compatibility reasons, since shmem-files predate sealing and silently enabling sealing on all shmem-files would have broken existing users (see shmem.c which initializes seals to F_SEAL_SEAL).
>
One may ask the question: If shmem-files  need to be non-sealable by
default, does memfd need to be so as well?

> I am not sure what you mean with "makes [memfd] semantics different from other objects in linux". Can you elaborate?
>
The memory sealing feature - mseal() went through similar discussion
on MAP_SEALABLE flag during the RFC phase,  but everyone doesn't like
the flag, and it gets dropped.
The feedback from communities for MAP_SEALABLE were.
- such a flag will slow down the adoption of the feature, i.e.
applications on multiple layers/libraries must change in order to use
sealing, i.e.  time of construction and  time of sealing might reside
in different libraries.
- Deny of service attack is likely not a concern,  the attacker that
is able to call mseal() can probably already call mprotect() or other
calls and achieve a similar DOS attack.

> Since `memfd_create` was introduced at the same time as shmem-sealing, it could certainly have enabled sealing by default. Not sure whether this would be preferable, though.
>
I would think making memfd sealable is desirable.

Probably the same for a shmem-file too.

> > Another solution is to change memfd to be by-default sealable,
> > although that will be an api change, but what side effect  will it be
> > ?
> > If we are worried about the memfd being sealed by an attacker, the
> > malicious code could also overwrite the content since memfd is not
> > sealed.
>
> You cannot change the default-seals retrospectively. There are existing shmem-users that share file-descriptors and *expect* the other party to be able to override data, but do *not* expect the other party to be able to apply seals. Note that these models explicitly *want* shared, writable access to the buffer (e.g., render-client shares a buffer with the display server for scanout), so just because you can *write* to a shmem-file does not mean that sharing is unsafe (e.g., using SIGBUS+mmap can safely deal with page-faults).
>
If the other party is controlled by an attacker, the attacker can
write garbage to the shm-file/memfd, that is already the end of the
game, at that point, sealing is no longer a concern, right?
If the threat-model is preventing attacker on the other side to write
the garbage data, then F_SEAL_WRITE|F_SEAL_SHRINK|F_SEAL_GROW can be
applied, in that case, default-sealable seems preferable because of
less code change.
If the other party needs to write to shmem/memfd anyway, then maybe
F_SEAL_EXEC needs to be applied ?

Thanks
-Jeff

> Thanks
> David
Jeff Xu May 28, 2024, 5:56 p.m. UTC | #15
Hi Aleksa,

On Fri, May 24, 2024 at 9:12 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> On 2024-05-23, Jeff Xu <jeffxu@google.com> wrote:

> > Regarding vm.memfd_noexec, on another topic.
> > I think in addition to  vm.memfd_noexec = 1 and 2,  there still could
> > be another state: 3
> >
> > =0. Do nothing.
> > =1. This will add MFD_NOEXEC_SEAL if application didn't set EXEC or
> > MFD_NOEXE_SEAL (to help with the migration)
> > =2: This will reject all calls without MFD_NOEXEC_SEAL (the whole
> > system doesn't allow executable memfd)
> > =3:  Application must set MFD_EXEC or MFD_NOEXEC_SEAL explicitly, or
> > else it will be rejected.
> >
> > 3 is useful because it lets applications choose what to use, and
> > forces applications to migrate to new semantics (this is what 2 did
> > before 9876cfe8).
> > The caveat is 3 is less restrictive than 2, so must document it clearly.
>
> As discussed at the time, "you must use this flag" is not a useful
> setting for a general purpose operating system because it explicitly
> disables backwards compatibility (breaking any application that was
> written in the past 10 years!) for no reason other than "new is better".
>
Are you referring to ratcheting in the sysctl in my original patch or
is this something else ?
I do not disagree with your change of  "removing the ratcheting" from
the admin point of view.

> As I suggested when we fixed the semantics of vm.memfd_noexec, if you
> really want to block a particular flag from not being set, seccomp lets
> you do this incredibly easily without acting as a footgun for admins.

seccomp can but it requires more work for the container, e.g.
container needs to allow-list all the syscalls. I'm trying to point
out that seccomp might not cover all user-cases.

"ratcheting" in the vm.memfd_noexec is lightweight  and can be applied
to the sandbox  of the container in advance, but since admin doesn't
like ratcheting in sysctl, maybe prctl or LSM are ways to implement
such restriction.

> Yes, vm.memfd_noexec can break programs that use executable memfds, but
> that is the point of the sysctl -- making vm.memfd_noexec break programs
> that don't use executable memfds (they are only guilty of being written
> before mid-2023) is not useful.
>
> In addition, making 3 less restrictive than 2 would make the original
> restriction mechanism useless. A malicious process could raise the
> setting to 3 and disable the "protection" (as discussed before, I really
> don't understand the threat model here, but making it possible to
> disable easily is pretty clearly).
> You could change the policy, but now
> you're adding more complexity for a feature that IMO doesn't really make
> sense in the first place.
>
The reason of 3 is help with migration (not for threat-model), e.g. a
container can force every apps run in the container migrates their
memfd_create  to use either MFD_EXEC or MFD_NOEXEC_SEAL.
But I understand what you mean, with current code,  adding 3 would
cause more confusion to vm.memfd_noexec. Perhaps a new sysctl or prctl
is the way to go if the app wants to force migration.
In the hinder sight: two sysctls would work betters:  the first deal
with migration, the second enforces NO_EXEC_SEAL.

Thanks
-Jeff


> > -Jeff
> >
> > > Reviewed-by: David Rheinsberg <david@readahead.eu>
> > >
> > > Thanks
> > > David
>
> --
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> <https://www.cyphar.com/>
Aleksa Sarai June 2, 2024, 9:45 a.m. UTC | #16
On 2024-05-28, Jeff Xu <jeffxu@google.com> wrote:
> Hi Aleksa,
> 
> On Fri, May 24, 2024 at 9:12 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > On 2024-05-23, Jeff Xu <jeffxu@google.com> wrote:
> 
> > > Regarding vm.memfd_noexec, on another topic.
> > > I think in addition to  vm.memfd_noexec = 1 and 2,  there still could
> > > be another state: 3
> > >
> > > =0. Do nothing.
> > > =1. This will add MFD_NOEXEC_SEAL if application didn't set EXEC or
> > > MFD_NOEXE_SEAL (to help with the migration)
> > > =2: This will reject all calls without MFD_NOEXEC_SEAL (the whole
> > > system doesn't allow executable memfd)
> > > =3:  Application must set MFD_EXEC or MFD_NOEXEC_SEAL explicitly, or
> > > else it will be rejected.
> > >
> > > 3 is useful because it lets applications choose what to use, and
> > > forces applications to migrate to new semantics (this is what 2 did
> > > before 9876cfe8).
> > > The caveat is 3 is less restrictive than 2, so must document it clearly.
> >
> > As discussed at the time, "you must use this flag" is not a useful
> > setting for a general purpose operating system because it explicitly
> > disables backwards compatibility (breaking any application that was
> > written in the past 10 years!) for no reason other than "new is better".
> >
> Are you referring to ratcheting in the sysctl in my original patch or
> is this something else ?
> I do not disagree with your change of  "removing the ratcheting" from
> the admin point of view.

I'm referring to your original patch making vm.memfd_noexec=2 reject
memfd_create() when called without MFD_EXEC or MFD_NOEXEC_SEAL (which is
what every program written pre-mid-2023 did, since those flags didn't
exist yet). This was fixed in 202e14222fad, and is separate to
ratcheting (which was fixed in 9876cfe8ec1c).

We definitely had a long discussion about it at the time.

> > As I suggested when we fixed the semantics of vm.memfd_noexec, if you
> > really want to block a particular flag from not being set, seccomp lets
> > you do this incredibly easily without acting as a footgun for admins.
> 
> seccomp can but it requires more work for the container, e.g.
> container needs to allow-list all the syscalls.

If you're applying the rule for a single syscall you can create a
deny-list, so no need for a full-on filter for everything. Also, most
containers already have allow-list seccomp filters applied, so adjusting
them to add a restriction for one syscall is not that complicated.

> I'm trying to point out that seccomp might not cover all user-cases.

One of the reasons I'm suggesting seccomp because I think that this
"use-case" probably only exists within ChromeOS, and so adding more
kernel infrastructure around it makes little sense. seccomp has
effectively no performance overhead for something this simple and lets
you block this if you really want to.

For general purpose distributions and systems, an administrative knob to
make working programs error out if they don't pass an effectively-noop
flag to memfd_create(2) doesn't help pressure anyone into migrating
because the random unmaintained program using memfd_create(2) isn't
developed by the same company making the distribution...

How would an admin setting vm.memfd_break_random_programs=1 help
random_app_downloaded_from_the_internet get patched to use
MFD_EXEC/MFD_NOEXEC_SEAL? (It doesn't.)

> "ratcheting" in the vm.memfd_noexec is lightweight  and can be applied
> to the sandbox  of the container in advance, but since admin doesn't
> like ratcheting in sysctl, maybe prctl or LSM are ways to implement
> such restriction.
>
> > Yes, vm.memfd_noexec can break programs that use executable memfds, but
> > that is the point of the sysctl -- making vm.memfd_noexec break programs
> > that don't use executable memfds (they are only guilty of being written
> > before mid-2023) is not useful.
> >
> > In addition, making 3 less restrictive than 2 would make the original
> > restriction mechanism useless. A malicious process could raise the
> > setting to 3 and disable the "protection" (as discussed before, I really
> > don't understand the threat model here, but making it possible to
> > disable easily is pretty clearly).
> > You could change the policy, but now
> > you're adding more complexity for a feature that IMO doesn't really make
> > sense in the first place.
> >
> The reason of 3 is help with migration (not for threat-model), e.g. a
> container can force every apps run in the container migrates their
> memfd_create  to use either MFD_EXEC or MFD_NOEXEC_SEAL.

This is the argument you had for the behaviour of vm.memfd_noexec=2 at
the time. In the discussion we had last year, I explained that this is
not "helpful" for the reasons explained above.

There are plenty of old interfaces in Linux and we don't generally push
people to migrate to newer interfaces, especially since in this case
we're talking about a flag that is a no-op for the vast majority of
programs on most systems.

Only programs that need MFD_EXEC actually _need_ to switch (if distros
eventually make vm.memfd_noexec=1 the default in a decade or two) and
the most well-known programs that use MFD_EXEC have already been patched
(runc is probably the most obvious one, and we patched this last year).

Not to mention that a lot of programs that are maintained have already
switched to the mostly-noop MFD_NOEXEC_SEAL, so the only programs that
would be affected by this migration "help" would be older programs that
won't be pressured to update because they're unmaintained.

> But I understand what you mean, with current code,  adding 3 would
> cause more confusion to vm.memfd_noexec. Perhaps a new sysctl or prctl
> is the way to go if the app wants to force migration.
> In the hinder sight: two sysctls would work betters:  the first deal
> with migration, the second enforces NO_EXEC_SEAL.

The "help with migration" feature shouldn't exist, and I removed it in
9876cfe8ec1c for a reason.
David Rheinsberg June 7, 2024, 8:38 a.m. UTC | #17
Hi

On Tue, May 28, 2024, at 7:13 PM, Jeff Xu wrote:
>> > Another solution is to change memfd to be by-default sealable,
>> > although that will be an api change, but what side effect  will it be
>> > ?
>> > If we are worried about the memfd being sealed by an attacker, the
>> > malicious code could also overwrite the content since memfd is not
>> > sealed.
>>
>> You cannot change the default-seals retrospectively. There are existing shmem-users that share file-descriptors and *expect* the other party to be able to override data, but do *not* expect the other party to be able to apply seals. Note that these models explicitly *want* shared, writable access to the buffer (e.g., render-client shares a buffer with the display server for scanout), so just because you can *write* to a shmem-file does not mean that sharing is unsafe (e.g., using SIGBUS+mmap can safely deal with page-faults).
>>
> If the other party is controlled by an attacker, the attacker can
> write garbage to the shm-file/memfd, that is already the end of the
> game, at that point, sealing is no longer a concern, right?

No. If a graphics client shares a buffer with a graphics server, the client is free to write garbage into the buffer. This is not unsafe. The graphics server will display whatever the client writes into the buffer. This is completely safe, without sealing and with a writable buffer.

> If the threat-model is preventing attacker on the other side to write
> the garbage data, then F_SEAL_WRITE|F_SEAL_SHRINK|F_SEAL_GROW can be
> applied, in that case, default-sealable seems preferable because of
> less code change.

Again, the threat-model is *NOT* concerned with writes.

Graphics clients/servers are a good example where *ANY* data is valid and can be processed by the privileged server. Hence, *ANY* writes are allowed and safe. No need for any seals. Those setups existed way before `memfd_create` was added (including seals).

However, when windows are resized, graphic buffers need to be resized as well. In those setups, the graphics server might call `ftruncate(2)`. If you suddenly make shmem-files sealable by default, a client can set `F_SEAL_SHRINK/GROW` and the privileged graphics server will get an error from `ftruncate(2)`, which it might not be able to handle, as it correctly never expected this to happen.

Thanks
David
Jeff Xu June 7, 2024, 3:58 p.m. UTC | #18
Hi David,

On Fri, Jun 7, 2024 at 1:38 AM David Rheinsberg <david@readahead.eu> wrote:
>
> Hi
>
> On Tue, May 28, 2024, at 7:13 PM, Jeff Xu wrote:
> >> > Another solution is to change memfd to be by-default sealable,
> >> > although that will be an api change, but what side effect  will it be
> >> > ?
> >> > If we are worried about the memfd being sealed by an attacker, the
> >> > malicious code could also overwrite the content since memfd is not
> >> > sealed.
> >>
> >> You cannot change the default-seals retrospectively. There are existing shmem-users that share file-descriptors and *expect* the other party to be able to override data, but do *not* expect the other party to be able to apply seals. Note that these models explicitly *want* shared, writable access to the buffer (e.g., render-client shares a buffer with the display server for scanout), so just because you can *write* to a shmem-file does not mean that sharing is unsafe (e.g., using SIGBUS+mmap can safely deal with page-faults).
> >>
> > If the other party is controlled by an attacker, the attacker can
> > write garbage to the shm-file/memfd, that is already the end of the
> > game, at that point, sealing is no longer a concern, right?
>
> No. If a graphics client shares a buffer with a graphics server, the client is free to write garbage into the buffer. This is not unsafe. The graphics server will display whatever the client writes into the buffer. This is completely safe, without sealing and with a writable buffer.
>
> > If the threat-model is preventing attacker on the other side to write
> > the garbage data, then F_SEAL_WRITE|F_SEAL_SHRINK|F_SEAL_GROW can be
> > applied, in that case, default-sealable seems preferable because of
> > less code change.
>
> Again, the threat-model is *NOT* concerned with writes.
>
> Graphics clients/servers are a good example where *ANY* data is valid and can be processed by the privileged server. Hence, *ANY* writes are allowed and safe. No need for any seals. Those setups existed way before `memfd_create` was added (including seals).
>
> However, when windows are resized, graphic buffers need to be resized as well. In those setups, the graphics server might call `ftruncate(2)`. If you suddenly make shmem-files sealable by default, a client can set `F_SEAL_SHRINK/GROW` and the privileged graphics server will get an error from `ftruncate(2)`, which it might not be able to handle, as it correctly never expected this to happen.
>

The graphic buffer  is a good example for shmem-files of
not-sealable-by-default. Thanks for the details.

-Jeff


> Thanks
> David
diff mbox series

Patch

diff --git a/mm/memfd.c b/mm/memfd.c
index 7d8d3ab3fa37..8b7f6afee21d 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -356,12 +356,11 @@  SYSCALL_DEFINE2(memfd_create,
 
 		inode->i_mode &= ~0111;
 		file_seals = memfd_file_seals_ptr(file);
-		if (file_seals) {
-			*file_seals &= ~F_SEAL_SEAL;
+		if (file_seals)
 			*file_seals |= F_SEAL_EXEC;
-		}
-	} else if (flags & MFD_ALLOW_SEALING) {
-		/* MFD_EXEC and MFD_ALLOW_SEALING are set */
+	}
+
+	if (flags & MFD_ALLOW_SEALING) {
 		file_seals = memfd_file_seals_ptr(file);
 		if (file_seals)
 			*file_seals &= ~F_SEAL_SEAL;
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
index 18f585684e20..b6a7ad68c3c1 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -1151,7 +1151,7 @@  static void test_noexec_seal(void)
 			    mfd_def_size,
 			    MFD_CLOEXEC | MFD_NOEXEC_SEAL);
 	mfd_assert_mode(fd, 0666);
-	mfd_assert_has_seals(fd, F_SEAL_EXEC);
+	mfd_assert_has_seals(fd, F_SEAL_SEAL | F_SEAL_EXEC);
 	mfd_fail_chmod(fd, 0777);
 	close(fd);
 }