mm/mmap: Map MAP_STACK to VM_STACK

Message ID	20230418210230.3495922-1-longman@redhat.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Waiman Long <longman@redhat.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Joe Mario <jmario@redhat.com>, Barry Marson <bmarson@redhat.com>, Rafael Aquini <aquini@redhat.com>, Waiman Long <longman@redhat.com> Subject: [PATCH] mm/mmap: Map MAP_STACK to VM_STACK Date: Tue, 18 Apr 2023 17:02:30 -0400 Message-Id: <20230418210230.3495922-1-longman@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/mmap: Map MAP_STACK to VM_STACK \| expand mm/mmap: Map MAP_STACK to VM_STACK

Waiman Long April 18, 2023, 9:02 p.m. UTC

One of the flags of mmap(2) is MAP_STACK to request a memory segment
suitable for a process or thread stack. The kernel currently ignores
this flags. Glibc uses MAP_STACK when mmapping a thread stack. However,
selinux has an execstack check in selinux_file_mprotect() which disallows
a stack VMA to be made executable.

Since MAP_STACK is a noop, it is possible for a stack VMA to be merged
with an adjacent anonymous VMA. With that merging, using mprotect(2)
to change a part of the merged anonymous VMA to make it executable may
fail. This can lead to sporadic failure of applications that need to
make those changes.

One possible fix is to make sure that a stack VMA will not be merged
with a non-stack anonymous VMA. That requires a vm flag that can be
used to distinguish a stack VMA from a regular anonymous VMA. One
can add a new dummy vm flag for that purpose. However, there is only
1 bit left in the lower 32 bits of vm_flags. Another alternative is
to use an existing vm flag. VM_STACK (= VM_GROWSDOWN for most arches)
can certainly be used for this purpose. The downside is that it is a
slight change in existing behavior.

Making a stack VMA growable by default certainly fits the need of a
process or thread stack. This patch now maps MAP_STACK to VM_STACK to
prevent unwanted merging with adjacent non-stack VMAs and make the VMA
more suitable for being used as a stack.

Reported-by: Joe Mario <jmario@redhat.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 include/linux/mman.h | 1 +
 1 file changed, 1 insertion(+)

Andrew Morton April 18, 2023, 9:18 p.m. UTC | #1

On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <longman@redhat.com> wrote:

> One of the flags of mmap(2) is MAP_STACK to request a memory segment
> suitable for a process or thread stack. The kernel currently ignores
> this flags. Glibc uses MAP_STACK when mmapping a thread stack. However,
> selinux has an execstack check in selinux_file_mprotect() which disallows
> a stack VMA to be made executable.
> 
> Since MAP_STACK is a noop, it is possible for a stack VMA to be merged
> with an adjacent anonymous VMA. With that merging, using mprotect(2)
> to change a part of the merged anonymous VMA to make it executable may
> fail. This can lead to sporadic failure of applications that need to
> make those changes.

"Sporadic failure of applications" sounds quite serious.  Can you
provide more details?

Did you consider a -stable backport?  I'm unable to judge, because the
description of the userspace effects is so thin,

> One possible fix is to make sure that a stack VMA will not be merged
> with a non-stack anonymous VMA. That requires a vm flag that can be
> used to distinguish a stack VMA from a regular anonymous VMA. One
> can add a new dummy vm flag for that purpose. However, there is only
> 1 bit left in the lower 32 bits of vm_flags. Another alternative is
> to use an existing vm flag. VM_STACK (= VM_GROWSDOWN for most arches)
> can certainly be used for this purpose. The downside is that it is a
> slight change in existing behavior.
> 
> Making a stack VMA growable by default certainly fits the need of a
> process or thread stack. This patch now maps MAP_STACK to VM_STACK to
> prevent unwanted merging with adjacent non-stack VMAs and make the VMA
> more suitable for being used as a stack.
> 
> ...
>
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -152,6 +152,7 @@ calc_vm_flag_bits(unsigned long flags)
>  	return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
>  	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
>  	       _calc_vm_trans(flags, MAP_SYNC,	     VM_SYNC      ) |
> +	       _calc_vm_trans(flags, MAP_STACK,	     VM_STACK     ) |
>  	       arch_calc_vm_flag_bits(flags);
>  }

The mmap(2) manpage says

  This flag is currently a no-op on Linux.  However, by employing
  this flag, applications can ensure that they transparently ob- tain
  support if the flag is implemented in the future.  Thus, it is used
  in the glibc threading implementation to allow for the fact that some
  architectures may (later) require special treat- ment for stack
  allocations.  A further reason to employ this flag is portability:
  MAP_STACK exists (and has an effect) on some other systems (e.g.,
  some of the BSDs).

so please propose an update for this?

Waiman Long April 19, 2023, 1:16 a.m. UTC | #2

On 4/18/23 17:18, Andrew Morton wrote:
> On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <longman@redhat.com> wrote:
>
>> One of the flags of mmap(2) is MAP_STACK to request a memory segment
>> suitable for a process or thread stack. The kernel currently ignores
>> this flags. Glibc uses MAP_STACK when mmapping a thread stack. However,
>> selinux has an execstack check in selinux_file_mprotect() which disallows
>> a stack VMA to be made executable.
>>
>> Since MAP_STACK is a noop, it is possible for a stack VMA to be merged
>> with an adjacent anonymous VMA. With that merging, using mprotect(2)
>> to change a part of the merged anonymous VMA to make it executable may
>> fail. This can lead to sporadic failure of applications that need to
>> make those changes.
> "Sporadic failure of applications" sounds quite serious.  Can you
> provide more details?

The problem boils down to the fact that it is possible for user code to 
mmap a region of memory and then for the kernel to merge the VMA for 
that memory with the VMA for one of the application's thread stacks. 
This is causing random SEGVs with one of our large customer application.

At a high level, this is what's happening:

  1) App runs creating lots of threads.
  2) It mmap's 256K pages of anonymous memory.
  3) It writes executable code to that memory.
  4) It calls mprotect() with PROT_EXEC on that memory so
     it can subsequently execute the code.

The above mprotect() will fail if the mmap'd region's VMA gets merged 
with the VMA for one of the thread stacks.  That's because the default 
RHEL SELinux policy is to not allow executable stacks.

>
> Did you consider a -stable backport?  I'm unable to judge, because the
> description of the userspace effects is so thin,

Yes, stable backport can be considered.


>
>> One possible fix is to make sure that a stack VMA will not be merged
>> with a non-stack anonymous VMA. That requires a vm flag that can be
>> used to distinguish a stack VMA from a regular anonymous VMA. One
>> can add a new dummy vm flag for that purpose. However, there is only
>> 1 bit left in the lower 32 bits of vm_flags. Another alternative is
>> to use an existing vm flag. VM_STACK (= VM_GROWSDOWN for most arches)
>> can certainly be used for this purpose. The downside is that it is a
>> slight change in existing behavior.
>>
>> Making a stack VMA growable by default certainly fits the need of a
>> process or thread stack. This patch now maps MAP_STACK to VM_STACK to
>> prevent unwanted merging with adjacent non-stack VMAs and make the VMA
>> more suitable for being used as a stack.
>>
>> ...
>>
>> --- a/include/linux/mman.h
>> +++ b/include/linux/mman.h
>> @@ -152,6 +152,7 @@ calc_vm_flag_bits(unsigned long flags)
>>   	return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
>>   	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
>>   	       _calc_vm_trans(flags, MAP_SYNC,	     VM_SYNC      ) |
>> +	       _calc_vm_trans(flags, MAP_STACK,	     VM_STACK     ) |
>>   	       arch_calc_vm_flag_bits(flags);
>>   }
> The mmap(2) manpage says
>
>    This flag is currently a no-op on Linux.  However, by employing
>    this flag, applications can ensure that they transparently ob- tain
>    support if the flag is implemented in the future.  Thus, it is used
>    in the glibc threading implementation to allow for the fact that some
>    architectures may (later) require special treat- ment for stack
>    allocations.  A further reason to employ this flag is portability:
>    MAP_STACK exists (and has an effect) on some other systems (e.g.,
>    some of the BSDs).
>
> so please propose an update for this?

OK, will do.

Thanks,
Longman

Hugh Dickins April 19, 2023, 1:36 a.m. UTC | #3

On Tue, 18 Apr 2023, Waiman Long wrote:
> On 4/18/23 17:18, Andrew Morton wrote:
> > On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <longman@redhat.com> wrote:
> >
> >> One of the flags of mmap(2) is MAP_STACK to request a memory segment
> >> suitable for a process or thread stack. The kernel currently ignores
> >> this flags. Glibc uses MAP_STACK when mmapping a thread stack. However,
> >> selinux has an execstack check in selinux_file_mprotect() which disallows
> >> a stack VMA to be made executable.
> >>
> >> Since MAP_STACK is a noop, it is possible for a stack VMA to be merged
> >> with an adjacent anonymous VMA. With that merging, using mprotect(2)
> >> to change a part of the merged anonymous VMA to make it executable may
> >> fail. This can lead to sporadic failure of applications that need to
> >> make those changes.
> > "Sporadic failure of applications" sounds quite serious.  Can you
> > provide more details?
> 
> The problem boils down to the fact that it is possible for user code to mmap a
> region of memory and then for the kernel to merge the VMA for that memory with
> the VMA for one of the application's thread stacks. This is causing random
> SEGVs with one of our large customer application.
> 
> At a high level, this is what's happening:
> 
>  1) App runs creating lots of threads.
>  2) It mmap's 256K pages of anonymous memory.
>  3) It writes executable code to that memory.
>  4) It calls mprotect() with PROT_EXEC on that memory so
>     it can subsequently execute the code.
> 
> The above mprotect() will fail if the mmap'd region's VMA gets merged with the
> VMA for one of the thread stacks.  That's because the default RHEL SELinux
> policy is to not allow executable stacks.

Then wouldn't the bug be at the SELinux end?  VMAs may have been merged
already, but the mprotect() with PROT_EXEC of the good non-stack range
will then split that area off from the stack again - maybe the SELinux
check does not understand that must happen?

Hugh

Waiman Long April 19, 2023, 1:45 a.m. UTC | #4

On 4/18/23 21:36, Hugh Dickins wrote:
> On Tue, 18 Apr 2023, Waiman Long wrote:
>> On 4/18/23 17:18, Andrew Morton wrote:
>>> On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <longman@redhat.com> wrote:
>>>
>>>> One of the flags of mmap(2) is MAP_STACK to request a memory segment
>>>> suitable for a process or thread stack. The kernel currently ignores
>>>> this flags. Glibc uses MAP_STACK when mmapping a thread stack. However,
>>>> selinux has an execstack check in selinux_file_mprotect() which disallows
>>>> a stack VMA to be made executable.
>>>>
>>>> Since MAP_STACK is a noop, it is possible for a stack VMA to be merged
>>>> with an adjacent anonymous VMA. With that merging, using mprotect(2)
>>>> to change a part of the merged anonymous VMA to make it executable may
>>>> fail. This can lead to sporadic failure of applications that need to
>>>> make those changes.
>>> "Sporadic failure of applications" sounds quite serious.  Can you
>>> provide more details?
>> The problem boils down to the fact that it is possible for user code to mmap a
>> region of memory and then for the kernel to merge the VMA for that memory with
>> the VMA for one of the application's thread stacks. This is causing random
>> SEGVs with one of our large customer application.
>>
>> At a high level, this is what's happening:
>>
>>   1) App runs creating lots of threads.
>>   2) It mmap's 256K pages of anonymous memory.
>>   3) It writes executable code to that memory.
>>   4) It calls mprotect() with PROT_EXEC on that memory so
>>      it can subsequently execute the code.
>>
>> The above mprotect() will fail if the mmap'd region's VMA gets merged with the
>> VMA for one of the thread stacks.  That's because the default RHEL SELinux
>> policy is to not allow executable stacks.
> Then wouldn't the bug be at the SELinux end?  VMAs may have been merged
> already, but the mprotect() with PROT_EXEC of the good non-stack range
> will then split that area off from the stack again - maybe the SELinux
> check does not understand that must happen?

The SELinux check is done per VMA, not a region within a VMA. After VMA 
merging, SELinux is probably not able to determine which part of a VMA 
is a stack unless we keep that information somewhere and provide an API 
for SELinux to query. That can be quite a lot of work. So the easiest 
way to prevent this problem is to avoid merging a stack VMA with a 
regular anonymous VMA.

Cheers,
Longman

Matthew Wilcox April 19, 2023, 3:24 a.m. UTC | #5

On Tue, Apr 18, 2023 at 09:45:34PM -0400, Waiman Long wrote:
> 
> On 4/18/23 21:36, Hugh Dickins wrote:
> > On Tue, 18 Apr 2023, Waiman Long wrote:
> > > On 4/18/23 17:18, Andrew Morton wrote:
> > > > On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <longman@redhat.com> wrote:
> > > > 
> > > > > One of the flags of mmap(2) is MAP_STACK to request a memory segment
> > > > > suitable for a process or thread stack. The kernel currently ignores
> > > > > this flags. Glibc uses MAP_STACK when mmapping a thread stack. However,
> > > > > selinux has an execstack check in selinux_file_mprotect() which disallows
> > > > > a stack VMA to be made executable.
> > > > > 
> > > > > Since MAP_STACK is a noop, it is possible for a stack VMA to be merged
> > > > > with an adjacent anonymous VMA. With that merging, using mprotect(2)
> > > > > to change a part of the merged anonymous VMA to make it executable may
> > > > > fail. This can lead to sporadic failure of applications that need to
> > > > > make those changes.
> > > > "Sporadic failure of applications" sounds quite serious.  Can you
> > > > provide more details?
> > > The problem boils down to the fact that it is possible for user code to mmap a
> > > region of memory and then for the kernel to merge the VMA for that memory with
> > > the VMA for one of the application's thread stacks. This is causing random
> > > SEGVs with one of our large customer application.
> > > 
> > > At a high level, this is what's happening:
> > > 
> > >   1) App runs creating lots of threads.
> > >   2) It mmap's 256K pages of anonymous memory.
> > >   3) It writes executable code to that memory.
> > >   4) It calls mprotect() with PROT_EXEC on that memory so
> > >      it can subsequently execute the code.
> > > 
> > > The above mprotect() will fail if the mmap'd region's VMA gets merged with the
> > > VMA for one of the thread stacks.  That's because the default RHEL SELinux
> > > policy is to not allow executable stacks.
> > Then wouldn't the bug be at the SELinux end?  VMAs may have been merged
> > already, but the mprotect() with PROT_EXEC of the good non-stack range
> > will then split that area off from the stack again - maybe the SELinux
> > check does not understand that must happen?
> 
> The SELinux check is done per VMA, not a region within a VMA. After VMA
> merging, SELinux is probably not able to determine which part of a VMA is a
> stack unless we keep that information somewhere and provide an API for
> SELinux to query. That can be quite a lot of work. So the easiest way to
> prevent this problem is to avoid merging a stack VMA with a regular
> anonymous VMA.

To paraphrase you, "Yes, SELinux is buggy, but we don't want to fix it".

Cc'ing the SELinux people so it can be fixed properly.

Matthew Wilcox April 19, 2023, 3:46 a.m. UTC | #6

On Tue, Apr 18, 2023 at 09:16:37PM -0400, Waiman Long wrote:
>  1) App runs creating lots of threads.
>  2) It mmap's 256K pages of anonymous memory.
>  3) It writes executable code to that memory.
>  4) It calls mprotect() with PROT_EXEC on that memory so
>     it can subsequently execute the code.
> 
> The above mprotect() will fail if the mmap'd region's VMA gets merged with
> the VMA for one of the thread stacks.  That's because the default RHEL
> SELinux policy is to not allow executable stacks.

By the way, this is a daft policy.  The policy you really want is
EXEC|WRITE is not allowed.  A non-writable stack is useless, so it's
actually a superset of your current policy.  Forbidding _simultaneous_
write and executable is just good programming.  This way, you don't need
to care about the underlying VMA's current permissions, you just need
to do:

	if ((prot & (PROT_EXEC|PROT_WRITE)) == (PROT_EXEC|PROT_WRITE))
		return -EACCESS;

Paul Moore April 19, 2023, 2:38 p.m. UTC | #7

On Tue, Apr 18, 2023 at 11:24 PM Matthew Wilcox <willy@infradead.org> wrote:
> On Tue, Apr 18, 2023 at 09:45:34PM -0400, Waiman Long wrote:
> > On 4/18/23 21:36, Hugh Dickins wrote:
> > > On Tue, 18 Apr 2023, Waiman Long wrote:
> > > > On 4/18/23 17:18, Andrew Morton wrote:
> > > > > On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <longman@redhat.com> wrote:
> > > > >
> > > > > > One of the flags of mmap(2) is MAP_STACK to request a memory segment
> > > > > > suitable for a process or thread stack. The kernel currently ignores
> > > > > > this flags. Glibc uses MAP_STACK when mmapping a thread stack. However,
> > > > > > selinux has an execstack check in selinux_file_mprotect() which disallows
> > > > > > a stack VMA to be made executable.
> > > > > >
> > > > > > Since MAP_STACK is a noop, it is possible for a stack VMA to be merged
> > > > > > with an adjacent anonymous VMA. With that merging, using mprotect(2)
> > > > > > to change a part of the merged anonymous VMA to make it executable may
> > > > > > fail. This can lead to sporadic failure of applications that need to
> > > > > > make those changes.
> > > > > "Sporadic failure of applications" sounds quite serious.  Can you
> > > > > provide more details?
> > > > The problem boils down to the fact that it is possible for user code to mmap a
> > > > region of memory and then for the kernel to merge the VMA for that memory with
> > > > the VMA for one of the application's thread stacks. This is causing random
> > > > SEGVs with one of our large customer application.
> > > >
> > > > At a high level, this is what's happening:
> > > >
> > > >   1) App runs creating lots of threads.
> > > >   2) It mmap's 256K pages of anonymous memory.
> > > >   3) It writes executable code to that memory.
> > > >   4) It calls mprotect() with PROT_EXEC on that memory so
> > > >      it can subsequently execute the code.
> > > >
> > > > The above mprotect() will fail if the mmap'd region's VMA gets merged with the
> > > > VMA for one of the thread stacks.  That's because the default RHEL SELinux
> > > > policy is to not allow executable stacks.
> > > Then wouldn't the bug be at the SELinux end?  VMAs may have been merged
> > > already, but the mprotect() with PROT_EXEC of the good non-stack range
> > > will then split that area off from the stack again - maybe the SELinux
> > > check does not understand that must happen?
> >
> > The SELinux check is done per VMA, not a region within a VMA. After VMA
> > merging, SELinux is probably not able to determine which part of a VMA is a
> > stack unless we keep that information somewhere and provide an API for
> > SELinux to query. That can be quite a lot of work. So the easiest way to
> > prevent this problem is to avoid merging a stack VMA with a regular
> > anonymous VMA.
>
> To paraphrase you, "Yes, SELinux is buggy, but we don't want to fix it".
>
> Cc'ing the SELinux people so it can be fixed properly.

SELinux needs some way to determine what memory region is currently
being used by an application's stacks.  The current logic can be found
in selinux_file_mprotect(), the relevant snippet is below:

int selinux_file_mprotect(struct vm_area_struct *vma, ...)
{
  ...
  } else if (!vma->vm_file &&
    ((vma->vm_start <= vma->vm_mm->start_stack &&
      vma->vm_end >= vma->vm_mm->start_stack) ||
    vma_is_stack_for_current(vma))) {
      rc = avc_has_perm(&selinux_state,
                        sid, sid, SECCLASS_PROCESS,
                        PROCESS__EXECSTACK, NULL);
 }
  ...
}

If someone has a better, more foolproof way to determine an
application's stack please let us know, or better yet submit a patch
:)

Waiman Long April 19, 2023, 3:07 p.m. UTC | #8

On 4/18/23 23:46, Matthew Wilcox wrote:
> On Tue, Apr 18, 2023 at 09:16:37PM -0400, Waiman Long wrote:
>>   1) App runs creating lots of threads.
>>   2) It mmap's 256K pages of anonymous memory.
>>   3) It writes executable code to that memory.
>>   4) It calls mprotect() with PROT_EXEC on that memory so
>>      it can subsequently execute the code.
>>
>> The above mprotect() will fail if the mmap'd region's VMA gets merged with
>> the VMA for one of the thread stacks.  That's because the default RHEL
>> SELinux policy is to not allow executable stacks.
> By the way, this is a daft policy.  The policy you really want is
> EXEC|WRITE is not allowed.  A non-writable stack is useless, so it's
> actually a superset of your current policy.  Forbidding _simultaneous_
> write and executable is just good programming.  This way, you don't need
> to care about the underlying VMA's current permissions, you just need
> to do:
>
> 	if ((prot & (PROT_EXEC|PROT_WRITE)) == (PROT_EXEC|PROT_WRITE))
> 		return -EACCESS;

I am not totally sure if the application changes the VMA to read-only 
first. Even if it does that, it highlights another possible issue when 
an anonymous VMA is merged with a stack VMA. Either the mprotect() to 
write-protect the VMA will fail or the application will segfault if it 
writes stuff to the stack. This particular issue is not related to 
SELinux. It provides another good idea why we should avoid merging stack 
VMA to anonymous VMA.

Cheers,
Longman

Matthew Wilcox April 19, 2023, 3:09 p.m. UTC | #9

On Wed, Apr 19, 2023 at 11:07:04AM -0400, Waiman Long wrote:
> On 4/18/23 23:46, Matthew Wilcox wrote:
> > On Tue, Apr 18, 2023 at 09:16:37PM -0400, Waiman Long wrote:
> > >   1) App runs creating lots of threads.
> > >   2) It mmap's 256K pages of anonymous memory.
> > >   3) It writes executable code to that memory.
> > >   4) It calls mprotect() with PROT_EXEC on that memory so
> > >      it can subsequently execute the code.
> > > 
> > > The above mprotect() will fail if the mmap'd region's VMA gets merged with
> > > the VMA for one of the thread stacks.  That's because the default RHEL
> > > SELinux policy is to not allow executable stacks.
> > By the way, this is a daft policy.  The policy you really want is
> > EXEC|WRITE is not allowed.  A non-writable stack is useless, so it's
> > actually a superset of your current policy.  Forbidding _simultaneous_
> > write and executable is just good programming.  This way, you don't need
> > to care about the underlying VMA's current permissions, you just need
> > to do:
> > 
> > 	if ((prot & (PROT_EXEC|PROT_WRITE)) == (PROT_EXEC|PROT_WRITE))
> > 		return -EACCESS;
> 
> I am not totally sure if the application changes the VMA to read-only first.
> Even if it does that, it highlights another possible issue when an anonymous
> VMA is merged with a stack VMA. Either the mprotect() to write-protect the
> VMA will fail or the application will segfault if it writes stuff to the
> stack. This particular issue is not related to SELinux. It provides another
> good idea why we should avoid merging stack VMA to anonymous VMA.

mprotect will split the VMA into two VMAs, one that is
PROT_READ|PROT_WRITE and one the is PROT_READ|PROT_EXEC.

Joe Mario April 19, 2023, 4 p.m. UTC | #10

On 4/19/23 11:09 AM, Matthew Wilcox wrote:
> On Wed, Apr 19, 2023 at 11:07:04AM -0400, Waiman Long wrote:
>> On 4/18/23 23:46, Matthew Wilcox wrote:
>>> On Tue, Apr 18, 2023 at 09:16:37PM -0400, Waiman Long wrote:
>>>>   1) App runs creating lots of threads.
>>>>   2) It mmap's 256K pages of anonymous memory.
>>>>   3) It writes executable code to that memory.
>>>>   4) It calls mprotect() with PROT_EXEC on that memory so
>>>>      it can subsequently execute the code.
>>>>
>>>> The above mprotect() will fail if the mmap'd region's VMA gets merged with
>>>> the VMA for one of the thread stacks.  That's because the default RHEL
>>>> SELinux policy is to not allow executable stacks.
>>> By the way, this is a daft policy.  The policy you really want is
>>> EXEC|WRITE is not allowed.  A non-writable stack is useless, so it's
>>> actually a superset of your current policy.  Forbidding _simultaneous_
>>> write and executable is just good programming.  This way, you don't need
>>> to care about the underlying VMA's current permissions, you just need
>>> to do:
>>>
>>> 	if ((prot & (PROT_EXEC|PROT_WRITE)) == (PROT_EXEC|PROT_WRITE))
>>> 		return -EACCESS;
>>
>> I am not totally sure if the application changes the VMA to read-only first.
>> Even if it does that, it highlights another possible issue when an anonymous
>> VMA is merged with a stack VMA. Either the mprotect() to write-protect the
>> VMA will fail or the application will segfault if it writes stuff to the
>> stack. This particular issue is not related to SELinux. It provides another
>> good idea why we should avoid merging stack VMA to anonymous VMA.
> 
> mprotect will split the VMA into two VMAs, one that is
> PROT_READ|PROT_WRITE and one the is PROT_READ|PROT_EXEC.
> 

But in this case, the latter still has PROT_WRITE.  

This was reported by a large data analytics customer.  They started getting infrequent random crashes in code they haven't touched in 10 years.

One of the threads in their program mmaps a large region using PROT_READ|PROT_WRITE, and that region just happens to be merged with the thread's stack.

Then they copy a small snipit of code to a location somewhere within that mapped region. For the one page that contains that code, they mprotect it to PROT_READ|PROT_WRITE|PROT_EXEC.  I recall they're still reading and writing data elsewhere on that page.

Joe

Jane Chu April 19, 2023, 11:21 p.m. UTC | #11

On 4/18/2023 2:18 PM, Andrew Morton wrote:
> On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <longman@redhat.com> wrote:
[..]
>> ...
>>
>> --- a/include/linux/mman.h
>> +++ b/include/linux/mman.h
>> @@ -152,6 +152,7 @@ calc_vm_flag_bits(unsigned long flags)
>>   	return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
>>   	       _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
>>   	       _calc_vm_trans(flags, MAP_SYNC,	     VM_SYNC      ) |
>> +	       _calc_vm_trans(flags, MAP_STACK,	     VM_STACK     ) |
>>   	       arch_calc_vm_flag_bits(flags);
>>   }
> 
> The mmap(2) manpage says
> 
>    This flag is currently a no-op on Linux.  However, by employing
>    this flag, applications can ensure that they transparently ob- tain
>    support if the flag is implemented in the future.  Thus, it is used
>    in the glibc threading implementation to allow for the fact that some
>    architectures may (later) require special treat- ment for stack
>    allocations.  A further reason to employ this flag is portability:
>    MAP_STACK exists (and has an effect) on some other systems (e.g.,
>    some of the BSDs).
> 
> so please propose an update for this?
> 

Just curious, why isn't MAP_STACK implemented in Linux kernel? what does 
it take to implement it?

Also, could there be other potential issue with the vma merge, such as, 
the user process start to truncate half of the anonymous memory vma 
range oblivious to the fact that the vma has 'grown' into its stack and 
it might be attempting to unmap some of its stack range?

If the vma merge is otherwise harmless, does it bring benefit other than 
being one vma less?

thanks!
-jane

Jane Chu April 20, 2023, midnight UTC | #12

On 4/19/2023 4:21 PM, Jane Chu wrote:
> On 4/18/2023 2:18 PM, Andrew Morton wrote:
>> On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <longman@redhat.com> 
>> wrote:
> [..]
>>> ...
>>>
>>> --- a/include/linux/mman.h
>>> +++ b/include/linux/mman.h
>>> @@ -152,6 +152,7 @@ calc_vm_flag_bits(unsigned long flags)
>>>       return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
>>>              _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
>>>              _calc_vm_trans(flags, MAP_SYNC,         VM_SYNC      ) |
>>> +           _calc_vm_trans(flags, MAP_STACK,         VM_STACK     ) |
>>>              arch_calc_vm_flag_bits(flags);
>>>   }
>>
>> The mmap(2) manpage says
>>
>>    This flag is currently a no-op on Linux.  However, by employing
>>    this flag, applications can ensure that they transparently ob- tain
>>    support if the flag is implemented in the future.  Thus, it is used
>>    in the glibc threading implementation to allow for the fact that some
>>    architectures may (later) require special treat- ment for stack
>>    allocations.  A further reason to employ this flag is portability:
>>    MAP_STACK exists (and has an effect) on some other systems (e.g.,
>>    some of the BSDs).
>>
>> so please propose an update for this?
>>
> 
> Just curious, why isn't MAP_STACK implemented in Linux kernel? what does 
> it take to implement it?
> 
> Also, could there be other potential issue with the vma merge, such as, 
> the user process start to truncate half of the anonymous memory vma 
> range oblivious to the fact that the vma has 'grown' into its stack and 
> it might be attempting to unmap some of its stack range?

Sorry, not 'oblivious'. how about a malicious user process get an fd via 
memfd_create() and attempt to truncate more than it mmap'ed?

> 
> If the vma merge is otherwise harmless, does it bring benefit other than 
> being one vma less?
> 
> thanks!
> -jane
> 
>

mm/mmap: Map MAP_STACK to VM_STACK

Commit Message

Comments

Patch