mbox series

[0/4] aarch64: avoid mprotect(PROT_BTI|PROT_EXEC) [BZ #26831]

Message ID cover.1604393169.git.szabolcs.nagy@arm.com (mailing list archive)
Headers show
Series aarch64: avoid mprotect(PROT_BTI|PROT_EXEC) [BZ #26831] | expand

Message

Szabolcs Nagy Nov. 3, 2020, 10:25 a.m. UTC
Re-mmap executable segments instead of mprotecting them in
case mprotect is seccomp filtered.

For the kernel mapped main executable we don't have the fd
for re-mmap so linux needs to be updated to add BTI. (In the
presence of seccomp filters for mprotect(PROT_EXEC) the libc
cannot change BTI protection at runtime based on user space
policy so it is better if the kernel maps BTI compatible
binaries with PROT_BTI by default.)

Szabolcs Nagy (4):
  elf: Pass the fd to note processing [BZ #26831]
  elf: Move note processing after l_phdr is updated [BZ #26831]
  aarch64: Use mmap to add PROT_BTI instead of mprotect [BZ #26831]
  aarch64: Remove the bti link_map field [BZ #26831]

 elf/dl-load.c              | 38 ++++++++++++++++---------------
 elf/rtld.c                 |  4 ++--
 sysdeps/aarch64/dl-bti.c   | 46 ++++++++++++++++++++------------------
 sysdeps/aarch64/dl-prop.h  | 17 +++++++-------
 sysdeps/aarch64/linkmap.h  |  1 -
 sysdeps/generic/dl-prop.h  |  6 ++---
 sysdeps/generic/ldsodefs.h |  5 +++--
 sysdeps/x86/dl-prop.h      |  6 ++---
 8 files changed, 64 insertions(+), 59 deletions(-)

Comments

Mark Brown Nov. 3, 2020, 5:34 p.m. UTC | #1
On Tue, Nov 03, 2020 at 10:25:37AM +0000, Szabolcs Nagy wrote:

> Re-mmap executable segments instead of mprotecting them in
> case mprotect is seccomp filtered.

> For the kernel mapped main executable we don't have the fd
> for re-mmap so linux needs to be updated to add BTI. (In the
> presence of seccomp filters for mprotect(PROT_EXEC) the libc
> cannot change BTI protection at runtime based on user space
> policy so it is better if the kernel maps BTI compatible
> binaries with PROT_BTI by default.)

Given that there were still some ongoing discussions on a more robust
kernel interface here and there seem to be a few concerns with this
series should we perhaps just take a step back and disable this seccomp
filter in systemd on arm64, at least for the time being?  That seems
safer than rolling out things that set ABI quickly, a big part of the
reason we went with having the dynamic linker enable PROT_BTI in the
first place was to give us more flexibility to handle any unforseen
consequences of enabling BTI that we run into.  We are going to have
similar issues with other features like MTE so we need to make sure that
whatever we're doing works with them too.

Also updated to Will's current e-mail address - Will, do you have
thoughts on what we should do here?
Jeremy Linton Nov. 4, 2020, 5:41 a.m. UTC | #2
Hi,

On 11/3/20 11:34 AM, Mark Brown wrote:
> On Tue, Nov 03, 2020 at 10:25:37AM +0000, Szabolcs Nagy wrote:
> 
>> Re-mmap executable segments instead of mprotecting them in
>> case mprotect is seccomp filtered.
> 
>> For the kernel mapped main executable we don't have the fd
>> for re-mmap so linux needs to be updated to add BTI. (In the
>> presence of seccomp filters for mprotect(PROT_EXEC) the libc
>> cannot change BTI protection at runtime based on user space
>> policy so it is better if the kernel maps BTI compatible
>> binaries with PROT_BTI by default.)
> 
> Given that there were still some ongoing discussions on a more robust
> kernel interface here and there seem to be a few concerns with this
> series should we perhaps just take a step back and disable this seccomp
> filter in systemd on arm64, at least for the time being?  That seems
> safer than rolling out things that set ABI quickly, a big part of the

So, that's a bigger hammer than I think is needed and punishes !BTI 
machines. I'm going to suggest that if we need to carry a temp patch its 
more like the glibc patch I mentioned in the Fedora defect. That patch 
simply logs a message, on the mprotect failures rather than aborting. 
Its fairly non-intrusive.

That leaves seccomp functional, and BTI generally functional except when 
seccomp is restricting it. I've also been asked that if a patch like 
that is needed, its (temporary?) merged to the glibc trunk, rather than 
just being carried by the distro's.

Thanks,

> reason we went with having the dynamic linker enable PROT_BTI in the
> first place was to give us more flexibility to handle any unforseen
> consequences of enabling BTI that we run into.  We are going to have
> similar issues with other features like MTE so we need to make sure that
> whatever we're doing works with them too.
> 
> Also updated to Will's current e-mail address - Will, do you have
> thoughts on what we should do here?
>
Szabolcs Nagy Nov. 4, 2020, 8:57 a.m. UTC | #3
The 11/03/2020 23:41, Jeremy Linton wrote:
> On 11/3/20 11:34 AM, Mark Brown wrote:
> > On Tue, Nov 03, 2020 at 10:25:37AM +0000, Szabolcs Nagy wrote:
> > 
> > > Re-mmap executable segments instead of mprotecting them in
> > > case mprotect is seccomp filtered.
> > 
> > > For the kernel mapped main executable we don't have the fd
> > > for re-mmap so linux needs to be updated to add BTI. (In the
> > > presence of seccomp filters for mprotect(PROT_EXEC) the libc
> > > cannot change BTI protection at runtime based on user space
> > > policy so it is better if the kernel maps BTI compatible
> > > binaries with PROT_BTI by default.)
> > 
> > Given that there were still some ongoing discussions on a more robust
> > kernel interface here and there seem to be a few concerns with this
> > series should we perhaps just take a step back and disable this seccomp
> > filter in systemd on arm64, at least for the time being?  That seems
> > safer than rolling out things that set ABI quickly, a big part of the
> 
> So, that's a bigger hammer than I think is needed and punishes !BTI
> machines. I'm going to suggest that if we need to carry a temp patch its
> more like the glibc patch I mentioned in the Fedora defect. That patch
> simply logs a message, on the mprotect failures rather than aborting. Its
> fairly non-intrusive.
> 
> That leaves seccomp functional, and BTI generally functional except when
> seccomp is restricting it. I've also been asked that if a patch like that is
> needed, its (temporary?) merged to the glibc trunk, rather than just being
> carried by the distro's.

note that changing mprotect into mmap in glibc works
even if the kernel or systemd decides to do things
differently: currently the only wart is that on the
main exe we have to use mprotect and silently ignore
the failures. (but e.g. some return to libc attacks
are reliably prevented since libc.so guaranteed to
have PROT_BTI this way.)

the patchset is a bit more intrusive than i hoped
for, so if we expect to get a new syscall then
simply ignoring mprotect failures may be a better
temporary solution. (and i still need to do
benchmarks to see if the cost of re-mmap is not
much more than the cost of mprotect.)

changing the seccomp filter in systemd does not
help if there are other seccomp filters elsewhere
doing the same. i think we will have to avoid using
mprotect(PROT_EXEC) or at least ignore the failure.

> > reason we went with having the dynamic linker enable PROT_BTI in the
> > first place was to give us more flexibility to handle any unforseen
> > consequences of enabling BTI that we run into.  We are going to have
> > similar issues with other features like MTE so we need to make sure that
> > whatever we're doing works with them too.
> > 
> > Also updated to Will's current e-mail address - Will, do you have
> > thoughts on what we should do here?
> >
Topi Miettinen Nov. 4, 2020, 9:02 a.m. UTC | #4
On 3.11.2020 19.34, Mark Brown wrote:
> On Tue, Nov 03, 2020 at 10:25:37AM +0000, Szabolcs Nagy wrote:
> 
>> Re-mmap executable segments instead of mprotecting them in
>> case mprotect is seccomp filtered.
> 
>> For the kernel mapped main executable we don't have the fd
>> for re-mmap so linux needs to be updated to add BTI. (In the
>> presence of seccomp filters for mprotect(PROT_EXEC) the libc
>> cannot change BTI protection at runtime based on user space
>> policy so it is better if the kernel maps BTI compatible
>> binaries with PROT_BTI by default.)
> 
> Given that there were still some ongoing discussions on a more robust
> kernel interface here and there seem to be a few concerns with this
> series should we perhaps just take a step back and disable this seccomp
> filter in systemd on arm64, at least for the time being?

Filtering mprotect() and mmap() with seccomp also protects BTI, since 
without it the attacker could remove PROT_BTI from existing pages, or 
map new pages without BTI. This would be possible even with SARA or 
SELinux execmem protections enabled, since they don't care about PROT_BTI.

-Topi
Will Deacon Nov. 4, 2020, 9:20 a.m. UTC | #5
On Tue, Nov 03, 2020 at 05:34:38PM +0000, Mark Brown wrote:
> On Tue, Nov 03, 2020 at 10:25:37AM +0000, Szabolcs Nagy wrote:
> 
> > Re-mmap executable segments instead of mprotecting them in
> > case mprotect is seccomp filtered.
> 
> > For the kernel mapped main executable we don't have the fd
> > for re-mmap so linux needs to be updated to add BTI. (In the
> > presence of seccomp filters for mprotect(PROT_EXEC) the libc
> > cannot change BTI protection at runtime based on user space
> > policy so it is better if the kernel maps BTI compatible
> > binaries with PROT_BTI by default.)
> 
> Given that there were still some ongoing discussions on a more robust
> kernel interface here and there seem to be a few concerns with this
> series should we perhaps just take a step back and disable this seccomp
> filter in systemd on arm64, at least for the time being?  That seems
> safer than rolling out things that set ABI quickly, a big part of the
> reason we went with having the dynamic linker enable PROT_BTI in the
> first place was to give us more flexibility to handle any unforseen
> consequences of enabling BTI that we run into.  We are going to have
> similar issues with other features like MTE so we need to make sure that
> whatever we're doing works with them too.
> 
> Also updated to Will's current e-mail address - Will, do you have
> thoughts on what we should do here?

Changing the kernel to map the main executable with PROT_BTI by default is a
user-visible change in behaviour and not without risk, so if we're going to
do that then it needs to be opt-in because the current behaviour has been
there since 5.8. I suppose we could shoe-horn in a cmdline option for 5.10
(which will be the first LTS with BTI) but it would be better to put up with
the current ABI if possible.

Is there real value in this seccomp filter if it only looks at mprotect(),
or was it just implemented because it's easy to do and sounds like a good
idea?

Will
Florian Weimer Nov. 4, 2020, 9:29 a.m. UTC | #6
* Will Deacon:

> Is there real value in this seccomp filter if it only looks at mprotect(),
> or was it just implemented because it's easy to do and sounds like a good
> idea?

It seems bogus to me.  Everyone will just create alias mappings instead,
just like they did for the similar SELinux feature.  See “Example code
to avoid execmem violations” in:

  <https://www.akkadia.org/drepper/selinux-mem.html>

As you can see, this reference implementation creates a PROT_WRITE
mapping aliased to a PROT_EXEC mapping, so it actually reduces security
compared to something that generates the code in an anonymous mapping
and calls mprotect to make it executable.

Furthermore, it requires unusual cache flushing code on some AArch64
implementations (a requirement that is not shared by any Linux other
architecture to which libffi has been ported), resulting in
hard-to-track-down real-world bugs.

Thanks,
Florian
Topi Miettinen Nov. 4, 2020, 9:55 a.m. UTC | #7
On 4.11.2020 11.29, Florian Weimer wrote:
> * Will Deacon:
> 
>> Is there real value in this seccomp filter if it only looks at mprotect(),
>> or was it just implemented because it's easy to do and sounds like a good
>> idea?
> 
> It seems bogus to me.  Everyone will just create alias mappings instead,
> just like they did for the similar SELinux feature.  See “Example code
> to avoid execmem violations” in:
> 
>    <https://www.akkadia.org/drepper/selinux-mem.html>

Also note "But this is very dangerous: programs should never use memory 
regions which are writable and executable at the same time. Assuming 
that it is really necessary to generate executable code while the 
program runs the method employed should be reconsidered."

> As you can see, this reference implementation creates a PROT_WRITE
> mapping aliased to a PROT_EXEC mapping, so it actually reduces security
> compared to something that generates the code in an anonymous mapping
> and calls mprotect to make it executable.

Drepper's methods to avoid SELinux protections are indeed the two ways 
(which I'm aware) to also avoid the seccomp filter: by using 
memfd_create() and using a file system which writable and executable to 
the process to create a new executable file. Both methods can be 
eliminated for many system services, memfd_create() with seccomp and 
filesystem W&X with mount namespaces.

If a service legitimately needs executable and writable mappings (due to 
JIT, trampolines etc), it's easy to disable the filter whenever really 
needed with "MemoryDenyWriteExecute=no" (which is the default) in case 
of systemd or a TE rule like "allow type_t self:process { execmem };" 
for SELinux. But this shouldn't be the default case, since there are 
many services which don't need W&X.

I'd also question what is the value of BTI if it can be easily 
circumvented by removing PROT_BTI with mprotect()?

-Topi
Mark Brown Nov. 4, 2020, 10:50 a.m. UTC | #8
On Tue, Nov 03, 2020 at 11:41:42PM -0600, Jeremy Linton wrote:
> On 11/3/20 11:34 AM, Mark Brown wrote:

> > Given that there were still some ongoing discussions on a more robust
> > kernel interface here and there seem to be a few concerns with this
> > series should we perhaps just take a step back and disable this seccomp
> > filter in systemd on arm64, at least for the time being?  That seems
> > safer than rolling out things that set ABI quickly, a big part of the

> So, that's a bigger hammer than I think is needed and punishes !BTI
> machines. I'm going to suggest that if we need to carry a temp patch its
> more like the glibc patch I mentioned in the Fedora defect. That patch
> simply logs a message, on the mprotect failures rather than aborting. Its
> fairly non-intrusive.

> That leaves seccomp functional, and BTI generally functional except when
> seccomp is restricting it. I've also been asked that if a patch like that is
> needed, its (temporary?) merged to the glibc trunk, rather than just being
> carried by the distro's.

The effect on pre-BTI hardware is an issue, another option would be for
systemd to disable this seccomp usage but only after checking for BTI
support in the system rather than just doing so purely based on the
architecture.
Catalin Marinas Nov. 4, 2020, 2:35 p.m. UTC | #9
On Wed, Nov 04, 2020 at 11:55:57AM +0200, Topi Miettinen wrote:
> On 4.11.2020 11.29, Florian Weimer wrote:
> > * Will Deacon:
> > 
> > > Is there real value in this seccomp filter if it only looks at mprotect(),
> > > or was it just implemented because it's easy to do and sounds like a good
> > > idea?
> > 
> > It seems bogus to me.  Everyone will just create alias mappings instead,
> > just like they did for the similar SELinux feature.  See “Example code
> > to avoid execmem violations” in:
> > 
> >    <https://www.akkadia.org/drepper/selinux-mem.html>
[...]
> > As you can see, this reference implementation creates a PROT_WRITE
> > mapping aliased to a PROT_EXEC mapping, so it actually reduces security
> > compared to something that generates the code in an anonymous mapping
> > and calls mprotect to make it executable.
[...]
> If a service legitimately needs executable and writable mappings (due to
> JIT, trampolines etc), it's easy to disable the filter whenever really
> needed with "MemoryDenyWriteExecute=no" (which is the default) in case of
> systemd or a TE rule like "allow type_t self:process { execmem };" for
> SELinux. But this shouldn't be the default case, since there are many
> services which don't need W&X.

I think Drepper's point is that separate X and W mappings, with enough
randomisation, would be more secure than allowing W&X at the same
address (but, of course, less secure than not having W at all, though
that's not always possible).

> I'd also question what is the value of BTI if it can be easily circumvented
> by removing PROT_BTI with mprotect()?

Well, BTI is a protection against JOP attacks. The assumption here is
that an attacker cannot invoke mprotect() to disable PROT_BTI. If it
can, it's probably not worth bothering with a subsequent JOP attack, it
can already call functions directly.

I see MDWX not as a way of detecting attacks but rather plugging
inadvertent security holes in certain programs. On arm64, such hardening
currently gets in the way of another hardening feature, BTI.
Catalin Marinas Nov. 4, 2020, 2:41 p.m. UTC | #10
On Wed, Nov 04, 2020 at 08:57:05AM +0000, Szabolcs Nagy wrote:
> The 11/03/2020 23:41, Jeremy Linton wrote:
> > On 11/3/20 11:34 AM, Mark Brown wrote:
> > > On Tue, Nov 03, 2020 at 10:25:37AM +0000, Szabolcs Nagy wrote:
> > > 
> > > > Re-mmap executable segments instead of mprotecting them in
> > > > case mprotect is seccomp filtered.
> > > 
> > > > For the kernel mapped main executable we don't have the fd
> > > > for re-mmap so linux needs to be updated to add BTI. (In the
> > > > presence of seccomp filters for mprotect(PROT_EXEC) the libc
> > > > cannot change BTI protection at runtime based on user space
> > > > policy so it is better if the kernel maps BTI compatible
> > > > binaries with PROT_BTI by default.)
> > > 
> > > Given that there were still some ongoing discussions on a more robust
> > > kernel interface here and there seem to be a few concerns with this
> > > series should we perhaps just take a step back and disable this seccomp
> > > filter in systemd on arm64, at least for the time being?  That seems
> > > safer than rolling out things that set ABI quickly, a big part of the
> > 
> > So, that's a bigger hammer than I think is needed and punishes !BTI
> > machines. I'm going to suggest that if we need to carry a temp patch its
> > more like the glibc patch I mentioned in the Fedora defect. That patch
> > simply logs a message, on the mprotect failures rather than aborting. Its
> > fairly non-intrusive.
> > 
> > That leaves seccomp functional, and BTI generally functional except when
> > seccomp is restricting it. I've also been asked that if a patch like that is
> > needed, its (temporary?) merged to the glibc trunk, rather than just being
> > carried by the distro's.
> 
> note that changing mprotect into mmap in glibc works
> even if the kernel or systemd decides to do things
> differently: currently the only wart is that on the
> main exe we have to use mprotect and silently ignore
> the failures.

Can the dynamic loader mmap() the main exe again while munmap'ing the
original one? (sorry if it was already discussed)
Florian Weimer Nov. 4, 2020, 2:45 p.m. UTC | #11
* Catalin Marinas:

> Can the dynamic loader mmap() the main exe again while munmap'ing the
> original one? (sorry if it was already discussed)

No, we don't have a descriptor for that.  /proc may not be mounted, and
using the path stored there has a race condition anyway.

Thanks,
Florian
Topi Miettinen Nov. 4, 2020, 3:19 p.m. UTC | #12
On 4.11.2020 16.35, Catalin Marinas wrote:
> On Wed, Nov 04, 2020 at 11:55:57AM +0200, Topi Miettinen wrote:
>> On 4.11.2020 11.29, Florian Weimer wrote:
>>> * Will Deacon:
>>>
>>>> Is there real value in this seccomp filter if it only looks at mprotect(),
>>>> or was it just implemented because it's easy to do and sounds like a good
>>>> idea?
>>>
>>> It seems bogus to me.  Everyone will just create alias mappings instead,
>>> just like they did for the similar SELinux feature.  See “Example code
>>> to avoid execmem violations” in:
>>>
>>>     <https://www.akkadia.org/drepper/selinux-mem.html>
> [...]
>>> As you can see, this reference implementation creates a PROT_WRITE
>>> mapping aliased to a PROT_EXEC mapping, so it actually reduces security
>>> compared to something that generates the code in an anonymous mapping
>>> and calls mprotect to make it executable.
> [...]
>> If a service legitimately needs executable and writable mappings (due to
>> JIT, trampolines etc), it's easy to disable the filter whenever really
>> needed with "MemoryDenyWriteExecute=no" (which is the default) in case of
>> systemd or a TE rule like "allow type_t self:process { execmem };" for
>> SELinux. But this shouldn't be the default case, since there are many
>> services which don't need W&X.
> 
> I think Drepper's point is that separate X and W mappings, with enough
> randomisation, would be more secure than allowing W&X at the same
> address (but, of course, less secure than not having W at all, though
> that's not always possible).
> 
>> I'd also question what is the value of BTI if it can be easily circumvented
>> by removing PROT_BTI with mprotect()?
> 
> Well, BTI is a protection against JOP attacks. The assumption here is
> that an attacker cannot invoke mprotect() to disable PROT_BTI. If it
> can, it's probably not worth bothering with a subsequent JOP attack, it
> can already call functions directly.

I suppose that the target for the attacker is to eventually perform 
system calls rather than looping forever in JOP/ROP gadgets.

> I see MDWX not as a way of detecting attacks but rather plugging
> inadvertent security holes in certain programs. On arm64, such hardening
> currently gets in the way of another hardening feature, BTI.

I don't think it has to get in the way at all. Why wouldn't something 
simple like this work:

diff --git a/elf/dl-load.c b/elf/dl-load.c
index 646c5dca40..12a74d15e8 100644
--- a/elf/dl-load.c
+++ b/elf/dl-load.c
@@ -1170,8 +1170,13 @@ _dl_map_object_from_fd (const char *name, const 
char *origname, int fd,
             c->prot |= PROT_READ;
           if (ph->p_flags & PF_W)
             c->prot |= PROT_WRITE;
-         if (ph->p_flags & PF_X)
+         if (ph->p_flags & PF_X) {
             c->prot |= PROT_EXEC;
+#ifdef PROT_BTI
+           if (GLRO(dl_bti) & 1)
+             c->prot |= PROT_BTI;
+#endif
+         }
  #endif
           break;

diff --git a/elf/dl-support.c b/elf/dl-support.c
index 7704c101c5..22c7cc7b81 100644
--- a/elf/dl-support.c
+++ b/elf/dl-support.c
@@ -222,7 +222,7 @@ __rtld_lock_define_initialized_recursive (, 
_dl_load_write_lock)


  #ifdef HAVE_AUX_VECTOR
-int _dl_clktck;
+int _dl_clktck, _dl_bti;

  void
  _dl_aux_init (ElfW(auxv_t) *av)
@@ -294,6 +294,11 @@ _dl_aux_init (ElfW(auxv_t) *av)
        case AT_RANDOM:
         _dl_random = (void *) av->a_un.a_val;
         break;
+#ifdef PROT_BTI
+      case AT_BTI:
+       _dl_bti = (void *) av->a_un.a_val;
+       break;
+#endif
        DL_PLATFORM_AUXV
        }
    if (seen == 0xf)

Kernel sets the aux vector to indicate that BTI should be enabled for 
all segments and main exe is already protected.

-Topi
Mark Rutland Nov. 4, 2020, 3:20 p.m. UTC | #13
On Wed, Nov 04, 2020 at 11:55:57AM +0200, Topi Miettinen wrote:
> On 4.11.2020 11.29, Florian Weimer wrote:
> > * Will Deacon:
> > 
> > > Is there real value in this seccomp filter if it only looks at mprotect(),
> > > or was it just implemented because it's easy to do and sounds like a good
> > > idea?
> > 
> > It seems bogus to me.  Everyone will just create alias mappings instead,
> > just like they did for the similar SELinux feature.  See “Example code
> > to avoid execmem violations” in:
> > 
> >    <https://www.akkadia.org/drepper/selinux-mem.html>
> 
> Also note "But this is very dangerous: programs should never use memory
> regions which are writable and executable at the same time. Assuming that it
> is really necessary to generate executable code while the program runs the
> method employed should be reconsidered."

Sure, and to be clear we're not trying to violate the "at the same time"
property. We do not want to permit simultaneous PROT_WRITE and PROT_EXEC
at any instant in time. What we're asking is to not block changing
permissions to PROT_EXEC in the absence of PROT_WRITE.

I think that the goal of preventing WRITE -> EXEC transitions for some
memory is sane, but I think the existing kernel primitives available to
systemd don't allow us to do that in a robust way because we don't have
all the relevant state tracked and accessible, and the existing approach
gets in the way of doing the right thing for other mitigations.

Consequently I think it would be better going forward to add a more
robust (kernel) mechanism for enforcement that can distinguish
WRITE->EXEC from EXEC->EXEC+BTI, and e.g. can be used to forbid aliasing
mappings with differing W/X permissions. Then userspace could eventually
transition over to that and get /stronger/ protection while permitting
the BTI case we'd like to work now.

> If a service legitimately needs executable and writable mappings (due to
> JIT, trampolines etc), it's easy to disable the filter whenever really
> needed with "MemoryDenyWriteExecute=no" (which is the default) in case of
> systemd or a TE rule like "allow type_t self:process { execmem };" for
> SELinux. But this shouldn't be the default case, since there are many
> services which don't need W&X.
> 
> I'd also question what is the value of BTI if it can be easily circumvented
> by removing PROT_BTI with mprotect()?

I agree that turning BTI off is a concern, and to that end I'd like to
add an enforcement mechanism whereby we could prevent that (ideally the
same mechanism by which we could prevent WRITE -> EXEC transitions). 

But, as with all things it's a matter of degree. MDWE and BTI are both
hurdles to an adversary, but neither are absolutes and there are
approaches to bypass either. By the time someone's issuing mprotect()
with an arbitrary VA and/or prot, they are liable to have been able to
do the same with mmap() and circumvent MDWE.

I'd really like to not have BTI silently disabled in order to work with
MDWE, because the risk is that it gets silently disabled elsewhere. The
risk of the changing the kernel to enable BTI for a binary is not well
known since we don't control other peoples libraries that might end up
not being compatible somehow with that. The risk of disabling a portion
of the MDWE protections seems to be the least out of the options we have
available, as unfortunate as it seems, and I think we can come up with a
better MDWE approach going forward.

Thanks,
Mark.
Szabolcs Nagy Nov. 4, 2020, 4:08 p.m. UTC | #14
The 11/04/2020 17:19, Topi Miettinen wrote:
> On 4.11.2020 16.35, Catalin Marinas wrote:
> > On Wed, Nov 04, 2020 at 11:55:57AM +0200, Topi Miettinen wrote:
> > > On 4.11.2020 11.29, Florian Weimer wrote:
> > > > * Will Deacon:
> > > > > Is there real value in this seccomp filter if it only looks at mprotect(),
> > > > > or was it just implemented because it's easy to do and sounds like a good
> > > > > idea?
> > > > 
> > > > It seems bogus to me.  Everyone will just create alias mappings instead,
> > > > just like they did for the similar SELinux feature.  See “Example code
> > > > to avoid execmem violations” in:
> > > > 
> > > >     <https://www.akkadia.org/drepper/selinux-mem.html>
> > [...]
> > > > As you can see, this reference implementation creates a PROT_WRITE
> > > > mapping aliased to a PROT_EXEC mapping, so it actually reduces security
> > > > compared to something that generates the code in an anonymous mapping
> > > > and calls mprotect to make it executable.
> > [...]
> > > If a service legitimately needs executable and writable mappings (due to
> > > JIT, trampolines etc), it's easy to disable the filter whenever really
> > > needed with "MemoryDenyWriteExecute=no" (which is the default) in case of
> > > systemd or a TE rule like "allow type_t self:process { execmem };" for
> > > SELinux. But this shouldn't be the default case, since there are many
> > > services which don't need W&X.
> > 
> > I think Drepper's point is that separate X and W mappings, with enough
> > randomisation, would be more secure than allowing W&X at the same
> > address (but, of course, less secure than not having W at all, though
> > that's not always possible).
> > 
> > > I'd also question what is the value of BTI if it can be easily circumvented
> > > by removing PROT_BTI with mprotect()?
> > 
> > Well, BTI is a protection against JOP attacks. The assumption here is
> > that an attacker cannot invoke mprotect() to disable PROT_BTI. If it
> > can, it's probably not worth bothering with a subsequent JOP attack, it
> > can already call functions directly.
> 
> I suppose that the target for the attacker is to eventually perform system
> calls rather than looping forever in JOP/ROP gadgets.
> 
> > I see MDWX not as a way of detecting attacks but rather plugging
> > inadvertent security holes in certain programs. On arm64, such hardening
> > currently gets in the way of another hardening feature, BTI.
> 
> I don't think it has to get in the way at all. Why wouldn't something simple
> like this work:

PROT_BTI is only valid on binaries that are BTI compatible.
to detect that, the load segments must be already mapped.

AT_BTI does not solve this: we want to be able to load legacy
elf modules. (a BTI enforcement setting may be useful where
incompatible modules are rejected, but that cannot be the
default for backward compatibility reasons.)

> 
> diff --git a/elf/dl-load.c b/elf/dl-load.c
> index 646c5dca40..12a74d15e8 100644
> --- a/elf/dl-load.c
> +++ b/elf/dl-load.c
> @@ -1170,8 +1170,13 @@ _dl_map_object_from_fd (const char *name, const char
> *origname, int fd,
>             c->prot |= PROT_READ;
>           if (ph->p_flags & PF_W)
>             c->prot |= PROT_WRITE;
> -         if (ph->p_flags & PF_X)
> +         if (ph->p_flags & PF_X) {
>             c->prot |= PROT_EXEC;
> +#ifdef PROT_BTI
> +           if (GLRO(dl_bti) & 1)
> +             c->prot |= PROT_BTI;
> +#endif
> +         }
>  #endif
>           break;
> 
> diff --git a/elf/dl-support.c b/elf/dl-support.c
> index 7704c101c5..22c7cc7b81 100644
> --- a/elf/dl-support.c
> +++ b/elf/dl-support.c
> @@ -222,7 +222,7 @@ __rtld_lock_define_initialized_recursive (,
> _dl_load_write_lock)
> 
> 
>  #ifdef HAVE_AUX_VECTOR
> -int _dl_clktck;
> +int _dl_clktck, _dl_bti;
> 
>  void
>  _dl_aux_init (ElfW(auxv_t) *av)
> @@ -294,6 +294,11 @@ _dl_aux_init (ElfW(auxv_t) *av)
>        case AT_RANDOM:
>         _dl_random = (void *) av->a_un.a_val;
>         break;
> +#ifdef PROT_BTI
> +      case AT_BTI:
> +       _dl_bti = (void *) av->a_un.a_val;
> +       break;
> +#endif
>        DL_PLATFORM_AUXV
>        }
>    if (seen == 0xf)
> 
> Kernel sets the aux vector to indicate that BTI should be enabled for all
> segments and main exe is already protected.
> 
> -Topi
Jeremy Linton Nov. 4, 2020, 6:47 p.m. UTC | #15
Hi,

On 11/4/20 4:50 AM, Mark Brown wrote:
> On Tue, Nov 03, 2020 at 11:41:42PM -0600, Jeremy Linton wrote:
>> On 11/3/20 11:34 AM, Mark Brown wrote:
> 
>>> Given that there were still some ongoing discussions on a more robust
>>> kernel interface here and there seem to be a few concerns with this
>>> series should we perhaps just take a step back and disable this seccomp
>>> filter in systemd on arm64, at least for the time being?  That seems
>>> safer than rolling out things that set ABI quickly, a big part of the
> 
>> So, that's a bigger hammer than I think is needed and punishes !BTI
>> machines. I'm going to suggest that if we need to carry a temp patch its
>> more like the glibc patch I mentioned in the Fedora defect. That patch
>> simply logs a message, on the mprotect failures rather than aborting. Its
>> fairly non-intrusive.
> 
>> That leaves seccomp functional, and BTI generally functional except when
>> seccomp is restricting it. I've also been asked that if a patch like that is
>> needed, its (temporary?) merged to the glibc trunk, rather than just being
>> carried by the distro's.
> 
> The effect on pre-BTI hardware is an issue, another option would be for
> systemd to disable this seccomp usage but only after checking for BTI
> support in the system rather than just doing so purely based on the
> architecture.
> 

That works, but your also losing seccomp in the case where the machine 
is BTI capable, but the service isn't. So it should really be checking 
the elf notes, but at that point you might just as well patch glibc.
Mark Brown Nov. 4, 2020, 6:53 p.m. UTC | #16
On Wed, Nov 04, 2020 at 12:47:09PM -0600, Jeremy Linton wrote:
> On 11/4/20 4:50 AM, Mark Brown wrote:

> > The effect on pre-BTI hardware is an issue, another option would be for
> > systemd to disable this seccomp usage but only after checking for BTI
> > support in the system rather than just doing so purely based on the
> > architecture.

> That works, but your also losing seccomp in the case where the machine is
> BTI capable, but the service isn't. So it should really be checking the elf
> notes, but at that point you might just as well patch glibc.

True, I guess I was assuming that a BTI rebuild is done at the distro
level but of course even if that's the case a system could have third
party binaries so you can't just assume that the world is BTI.
Jeremy Linton Nov. 4, 2020, 6:59 p.m. UTC | #17
Hi,

On 11/4/20 9:20 AM, Mark Rutland wrote:
> On Wed, Nov 04, 2020 at 11:55:57AM +0200, Topi Miettinen wrote:
>> On 4.11.2020 11.29, Florian Weimer wrote:
>>> * Will Deacon:
>>>
>>>> Is there real value in this seccomp filter if it only looks at mprotect(),
>>>> or was it just implemented because it's easy to do and sounds like a good
>>>> idea?
>>>
>>> It seems bogus to me.  Everyone will just create alias mappings instead,
>>> just like they did for the similar SELinux feature.  See “Example code
>>> to avoid execmem violations” in:
>>>
>>>     <https://www.akkadia.org/drepper/selinux-mem.html>
>>
>> Also note "But this is very dangerous: programs should never use memory
>> regions which are writable and executable at the same time. Assuming that it
>> is really necessary to generate executable code while the program runs the
>> method employed should be reconsidered."
> 
> Sure, and to be clear we're not trying to violate the "at the same time"
> property. We do not want to permit simultaneous PROT_WRITE and PROT_EXEC
> at any instant in time. What we're asking is to not block changing
> permissions to PROT_EXEC in the absence of PROT_WRITE.
> 
> I think that the goal of preventing WRITE -> EXEC transitions for some
> memory is sane, but I think the existing kernel primitives available to
> systemd don't allow us to do that in a robust way because we don't have
> all the relevant state tracked and accessible, and the existing approach
> gets in the way of doing the right thing for other mitigations.
> 
> Consequently I think it would be better going forward to add a more
> robust (kernel) mechanism for enforcement that can distinguish
> WRITE->EXEC from EXEC->EXEC+BTI, and e.g. can be used to forbid aliasing
> mappings with differing W/X permissions. Then userspace could eventually
> transition over to that and get /stronger/ protection while permitting
> the BTI case we'd like to work now.
> 
>> If a service legitimately needs executable and writable mappings (due to
>> JIT, trampolines etc), it's easy to disable the filter whenever really
>> needed with "MemoryDenyWriteExecute=no" (which is the default) in case of
>> systemd or a TE rule like "allow type_t self:process { execmem };" for
>> SELinux. But this shouldn't be the default case, since there are many
>> services which don't need W&X.
>>
>> I'd also question what is the value of BTI if it can be easily circumvented
>> by removing PROT_BTI with mprotect()?
> 
> I agree that turning BTI off is a concern, and to that end I'd like to
> add an enforcement mechanism whereby we could prevent that (ideally the
> same mechanism by which we could prevent WRITE -> EXEC transitions).
> 
> But, as with all things it's a matter of degree. MDWE and BTI are both
> hurdles to an adversary, but neither are absolutes and there are
> approaches to bypass either. By the time someone's issuing mprotect()
> with an arbitrary VA and/or prot, they are liable to have been able to
> do the same with mmap() and circumvent MDWE.
> 
> I'd really like to not have BTI silently disabled in order to work with
> MDWE, because the risk is that it gets silently disabled elsewhere. The
> risk of the changing the kernel to enable BTI for a binary is not well
> known since we don't control other peoples libraries that might end up
> not being compatible somehow with that. The risk of disabling a portion
> of the MDWE protections seems to be the least out of the options we have
> available, as unfortunate as it seems, and I think we can come up with a
> better MDWE approach going forward.

OTOH, You don't really want to blanket disable either protection, and 
unfortunately  you can't really tell until its too late if the service 
is fully BTI enabled. So you either end up disabling MDWE unnecessarily, 
or you delay until the only choice is not enabling BTI.

I guess there is another option too, which is some kind of delayed MDWE 
policy that only turns on once the service has started, but that isn't 
ideal either.

.

> 
> Thanks,
> Mark.
>
Szabolcs Nagy Nov. 5, 2020, 11:31 a.m. UTC | #18
The 11/04/2020 09:20, Will Deacon wrote:
> On Tue, Nov 03, 2020 at 05:34:38PM +0000, Mark Brown wrote:
> > On Tue, Nov 03, 2020 at 10:25:37AM +0000, Szabolcs Nagy wrote:
> > 
> > > Re-mmap executable segments instead of mprotecting them in
> > > case mprotect is seccomp filtered.
> > 
> > > For the kernel mapped main executable we don't have the fd
> > > for re-mmap so linux needs to be updated to add BTI. (In the
> > > presence of seccomp filters for mprotect(PROT_EXEC) the libc
> > > cannot change BTI protection at runtime based on user space
> > > policy so it is better if the kernel maps BTI compatible
> > > binaries with PROT_BTI by default.)
> > 
> > Given that there were still some ongoing discussions on a more robust
> > kernel interface here and there seem to be a few concerns with this
> > series should we perhaps just take a step back and disable this seccomp
> > filter in systemd on arm64, at least for the time being?  That seems
> > safer than rolling out things that set ABI quickly, a big part of the
> > reason we went with having the dynamic linker enable PROT_BTI in the
> > first place was to give us more flexibility to handle any unforseen
> > consequences of enabling BTI that we run into.  We are going to have
> > similar issues with other features like MTE so we need to make sure that
> > whatever we're doing works with them too.
> > 
> > Also updated to Will's current e-mail address - Will, do you have
> > thoughts on what we should do here?
> 
> Changing the kernel to map the main executable with PROT_BTI by default is a
> user-visible change in behaviour and not without risk, so if we're going to
> do that then it needs to be opt-in because the current behaviour has been
> there since 5.8. I suppose we could shoe-horn in a cmdline option for 5.10
> (which will be the first LTS with BTI) but it would be better to put up with
> the current ABI if possible.

it's not clear to me how adding PROT_BTI in
the kernel would be observable in practice.

adding PROT_BTI to marked elf modules should
only have effect in cases when the process does
invalid operations and then there is no compat
requirement. if this is not the case then adding
PROT_BTI on static exe is already problematic.

if there is some issue with bti that makes
users want to turn it off, then they should do
it system wide or may be we can have a no-bti
option in userspace which uses mprotect to turn
it off (but since that has to be an explicit
opt-out i don't mind if the user also has to
disable the seccomp sandbox).

> Is there real value in this seccomp filter if it only looks at mprotect(),
> or was it just implemented because it's easy to do and sounds like a good
> idea?

i'm fine with just using mprotect and telling
users to remove the seccomp filter. but that
makes bti less attractive for deployment.