diff mbox

Don't mlock guardpage if the stack is growing up

Message ID BANLkTim3kKJkisZK1mOgKhsEEs7FzZmyXA@mail.gmail.com (mailing list archive)
State Superseded
Headers show

Commit Message

Linus Torvalds May 9, 2011, 10:07 p.m. UTC
On Mon, May 9, 2011 at 8:57 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Hmm. One thing that strikes me is this problem also implies that the
> /proc/self/maps file is wrong for the GROWSUP case, isn't it?
>
> So I think we should not just apply your lock fix, but then *also*
> apply something like this:

Actually, I think we might be better off with something like this.

It makes a few more changes:

 - move the stack guard page checking in __get_user_pages() into the
rare case (ie we didn't find a page), since that's the only case we
care about (the thing about the guard page is that don't want to call
"handle_mm_fault()"). As a result, it's off any path where we can
possibly care about performance, so we might as well have a nice
helper function for both the grow-up and grow-down cases, instead of
trying to be clever and only look at the grow-down case for the first
page in the vma like you did in your patch.

   End result: simpler, more straightforward code.

 - Move the growsup/down helper functions to <linux/mm.h>, since the
/proc code really wants to use them too. That means that the
"vma_stack_continue()" function (which now got split up into two
cases, for the up/down cases) is now entirely just an internal helper
function - nobody else uses it, and the real interface are the
"stack_guard_page_xyz()"  functions. Renamed to be simpler.

 - changed that naming of those stack_guard_page functions to use
_start and _end instead of growsup/growsdown, since it actually takes
the start or the end of the page as the argument (to match the
semantics of the afore-mentioned helpers)

 - and finally, make /proc/<pid>/maps use these helpers for both the
up/down case, so now /proc/self/maps should work well for the growsup
case too.

Hmm?

The only oddish case is IA64 that actually has a stack that grows
*both* up and down. That means that I could make up a stack mapping
that has a single virtual page in it, that is both the start *and* the
end page. Now /proc/self/maps would actually show such a mapping with
"negative" size. That's interesting.

It would be easy enough to have a "if (end < start) end = start" there
for that case, but maybe it's actually interesting information.

Regardless, I'd like to hear whether this patch really does work on
PA-RISC and especially IA64. I think those are the only cases that
have a GROWSUP stack. And the IA64 case that supports both is the most
interesting, everybody else does just one or the other.

                    Linus

Comments

James Bottomley May 9, 2011, 10:19 p.m. UTC | #1
On Mon, 2011-05-09 at 15:07 -0700, Linus Torvalds wrote:
> Regardless, I'd like to hear whether this patch really does work on
> PA-RISC and especially IA64. I think those are the only cases that
> have a GROWSUP stack. And the IA64 case that supports both is the most
> interesting, everybody else does just one or the other.

So I can test the patch, if you tell me how.  I don't use lvm2, so it
would have to be a simple test case.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka May 9, 2011, 10:26 p.m. UTC | #2
On Mon, 9 May 2011, Linus Torvalds wrote:

> On Mon, May 9, 2011 at 8:57 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Hmm. One thing that strikes me is this problem also implies that the
> > /proc/self/maps file is wrong for the GROWSUP case, isn't it?
> >
> > So I think we should not just apply your lock fix, but then *also*
> > apply something like this:
> 
> Actually, I think we might be better off with something like this.
> 
> It makes a few more changes:
> 
>  - move the stack guard page checking in __get_user_pages() into the
> rare case (ie we didn't find a page), since that's the only case we
> care about (the thing about the guard page is that don't want to call
> "handle_mm_fault()"). As a result, it's off any path where we can
> possibly care about performance, so we might as well have a nice
> helper function for both the grow-up and grow-down cases, instead of
> trying to be clever and only look at the grow-down case for the first
> page in the vma like you did in your patch.
> 
>    End result: simpler, more straightforward code.
> 
>  - Move the growsup/down helper functions to <linux/mm.h>, since the
> /proc code really wants to use them too. That means that the
> "vma_stack_continue()" function (which now got split up into two
> cases, for the up/down cases) is now entirely just an internal helper
> function - nobody else uses it, and the real interface are the
> "stack_guard_page_xyz()"  functions. Renamed to be simpler.
> 
>  - changed that naming of those stack_guard_page functions to use
> _start and _end instead of growsup/growsdown, since it actually takes
> the start or the end of the page as the argument (to match the
> semantics of the afore-mentioned helpers)
> 
>  - and finally, make /proc/<pid>/maps use these helpers for both the
> up/down case, so now /proc/self/maps should work well for the growsup
> case too.
> 
> Hmm?
> 
> The only oddish case is IA64 that actually has a stack that grows
> *both* up and down. That means that I could make up a stack mapping
> that has a single virtual page in it, that is both the start *and* the
> end page. Now /proc/self/maps would actually show such a mapping with
> "negative" size. That's interesting.
> 
> It would be easy enough to have a "if (end < start) end = start" there
> for that case, but maybe it's actually interesting information.
> 
> Regardless, I'd like to hear whether this patch really does work on
> PA-RISC and especially IA64. I think those are the only cases that
> have a GROWSUP stack. And the IA64 case that supports both is the most
> interesting, everybody else does just one or the other.
> 
>                     Linus

I will test it after a week, now I'm traveling away.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds May 9, 2011, 10:31 p.m. UTC | #3
On Mon, May 9, 2011 at 3:19 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> So I can test the patch, if you tell me how.  I don't use lvm2, so it
> would have to be a simple test case.

Something like the attached. Run it as root (it needs root just for
the mlockall), and see whether the stack it shows changes.

With current kernels, I think the stack expands by one page during the
mlockall (for STACK_GROWSUP), with the patch it shouldn't.

                   Linus
Tony Luck May 9, 2011, 10:53 p.m. UTC | #4
On Mon, May 9, 2011 at 3:31 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> With current kernels, I think the stack expands by one page during the
> mlockall (for STACK_GROWSUP), with the patch it shouldn't.

Tried this on ia64 (with a mod because the upward growing stack isn't
blessed with
the [stack] annotation, only the downward growing stack gets that honour).

ia64 builds, boots, and processes can still grow stacks (both of them).  The
patched kernel doesn't change the size of the upwardly growing stack across
the mlockall().

-Tony

P.S. while we could start both stacks on the same page and have the grow
away from the start point, ia64 actually starts them out a fair distance apart
and lets them run into each other (if you have enough memory to let them
grow that far, and if ulimit -s doesn't stop them earlier)
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds May 9, 2011, 10:58 p.m. UTC | #5
On Mon, May 9, 2011 at 3:53 PM, Tony Luck <tony.luck@gmail.com> wrote:
>
> P.S. while we could start both stacks on the same page and have the grow
> away from the start point, ia64 actually starts them out a fair distance apart
> and lets them run into each other (if you have enough memory to let them
> grow that far, and if ulimit -s doesn't stop them earlier)

Ahh, so you never actually have one single mapping that has both flags set?

In that case, I won't even worry about it.

One thing I did want to verify: did the mlockall() actually change the
stack size without that patch? Just to double-check that the patch
actually did change semantics visibly.

                 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tony Luck May 9, 2011, 11:08 p.m. UTC | #6
On Mon, May 9, 2011 at 3:58 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Ahh, so you never actually have one single mapping that has both flags set?
>
> In that case, I won't even worry about it.

Definitely not for normal processes - I'm not sure how both stacks are
set up for threads.

> One thing I did want to verify: did the mlockall() actually change the
> stack size without that patch? Just to double-check that the patch
> actually did change semantics visibly.

On an unpatched system I see this (lots more than one page of growth -
pages are 64K on this config):
6007fffffff50000-6007fffffff70000 rw-p 00000000 00:00 0
6007fffffff50000-6008000000750000 rw-p 00000000 00:00 0

On a patched system I see (this one has 16K pages - no growth)
600007ffff9d0000-600007ffff9d4000 rw-p 00000000 00:00 0
600007ffff9d0000-600007ffff9d4000 rw-p 00000000 00:00 0

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds May 9, 2011, 11:17 p.m. UTC | #7
On Mon, May 9, 2011 at 4:08 PM, Tony Luck <tony.luck@gmail.com> wrote:
>
> Definitely not for normal processes - I'm not sure how both stacks are
> set up for threads.

We don't actually allow user space to set the growsup/growsdown bits
any more (we have PROT_GROWSUP and PROT_GROWSDOWN, but that is to
allow mprotect to not give an exact range, but say "apply this to the
end of a growsup/growsdown segment").

So the only thing that has those bits are things that the kernel sets
explicitly at exec time. So if ia64 doesn't set it, we're all good.

>> One thing I did want to verify: did the mlockall() actually change the
>> stack size without that patch? Just to double-check that the patch
>> actually did change semantics visibly.
>
> On an unpatched system I see this (lots more than one page of growth -
> pages are 64K on this config):
> 6007fffffff50000-6007fffffff70000 rw-p 00000000 00:00 0
> 6007fffffff50000-6008000000750000 rw-p 00000000 00:00 0
>
> On a patched system I see (this one has 16K pages - no growth)
> 600007ffff9d0000-600007ffff9d4000 rw-p 00000000 00:00 0
> 600007ffff9d0000-600007ffff9d4000 rw-p 00000000 00:00 0

Ok, I'll consider it tested. I'll commit it with Mikulas as author,
but note that I edited it so he won't get the blame if there's some
problem.

                             Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds May 9, 2011, 11:25 p.m. UTC | #8
On Mon, May 9, 2011 at 4:17 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Ok, I'll consider it tested. I'll commit it with Mikulas as author,
> but note that I edited it so he won't get the blame if there's some
> problem.

Oh, and I marked it for stable too, although I don't know if any
distribution really cares about parisc or ia64. And I'm not sure that
ia64 even saw the lvm2 failure case - I'd have expected to hear about
it if it actually happens there.

                       Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
James Bottomley May 10, 2011, 4:12 a.m. UTC | #9
On Mon, 2011-05-09 at 16:25 -0700, Linus Torvalds wrote:
> On Mon, May 9, 2011 at 4:17 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Ok, I'll consider it tested. I'll commit it with Mikulas as author,
> > but note that I edited it so he won't get the blame if there's some
> > problem.
> 
> Oh, and I marked it for stable too, although I don't know if any
> distribution really cares about parisc or ia64. And I'm not sure that
> ia64 even saw the lvm2 failure case - I'd have expected to hear about
> it if it actually happens there.

Well, it's a done deal, but here's the proof on parisc too:

Before:

c0266000-c0289000 rwxp 00000000 00:00 0                                  [stack]
c0266000-c0a66000 rwxp 00000000 00:00 0                                  [stack]

After:

bffee000-c0010000 rwxp 00000000 00:00 0                                  [stack]
bffee000-c0010000 rwxp 00000000 00:00 0                                  [stack]

James


--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mikulas Patocka May 15, 2011, 10:18 p.m. UTC | #10
On Tue, 10 May 2011, Mikulas Patocka wrote:

> On Mon, 9 May 2011, Linus Torvalds wrote:
> 
> > On Mon, May 9, 2011 at 8:57 AM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> > >
> > > Hmm. One thing that strikes me is this problem also implies that the
> > > /proc/self/maps file is wrong for the GROWSUP case, isn't it?
> > >
> > > So I think we should not just apply your lock fix, but then *also*
> > > apply something like this:
> > 
> > Actually, I think we might be better off with something like this.
> > 
> > It makes a few more changes:
> > 
> >  - move the stack guard page checking in __get_user_pages() into the
> > rare case (ie we didn't find a page), since that's the only case we
> > care about (the thing about the guard page is that don't want to call
> > "handle_mm_fault()"). As a result, it's off any path where we can
> > possibly care about performance, so we might as well have a nice
> > helper function for both the grow-up and grow-down cases, instead of
> > trying to be clever and only look at the grow-down case for the first
> > page in the vma like you did in your patch.
> > 
> >    End result: simpler, more straightforward code.
> > 
> >  - Move the growsup/down helper functions to <linux/mm.h>, since the
> > /proc code really wants to use them too. That means that the
> > "vma_stack_continue()" function (which now got split up into two
> > cases, for the up/down cases) is now entirely just an internal helper
> > function - nobody else uses it, and the real interface are the
> > "stack_guard_page_xyz()"  functions. Renamed to be simpler.
> > 
> >  - changed that naming of those stack_guard_page functions to use
> > _start and _end instead of growsup/growsdown, since it actually takes
> > the start or the end of the page as the argument (to match the
> > semantics of the afore-mentioned helpers)
> > 
> >  - and finally, make /proc/<pid>/maps use these helpers for both the
> > up/down case, so now /proc/self/maps should work well for the growsup
> > case too.
> > 
> > Hmm?
> > 
> > The only oddish case is IA64 that actually has a stack that grows
> > *both* up and down. That means that I could make up a stack mapping
> > that has a single virtual page in it, that is both the start *and* the
> > end page. Now /proc/self/maps would actually show such a mapping with
> > "negative" size. That's interesting.
> > 
> > It would be easy enough to have a "if (end < start) end = start" there
> > for that case, but maybe it's actually interesting information.
> > 
> > Regardless, I'd like to hear whether this patch really does work on
> > PA-RISC and especially IA64. I think those are the only cases that
> > have a GROWSUP stack. And the IA64 case that supports both is the most
> > interesting, everybody else does just one or the other.
> > 
> >                     Linus
> 
> I will test it after a week, now I'm traveling away.
> 
> Mikulas

Hi

I tested 2.6.39-rc7 in on PA-RISC and confirm that it works.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-parisc" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

 fs/proc/task_mmu.c |   12 +++++++-----
 include/linux/mm.h |   24 +++++++++++++++++++++++-
 mm/memory.c        |   16 +++++++---------
 3 files changed, 37 insertions(+), 15 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2e7addfd9803..318d8654989b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -214,7 +214,7 @@  static void show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
 	int flags = vma->vm_flags;
 	unsigned long ino = 0;
 	unsigned long long pgoff = 0;
-	unsigned long start;
+	unsigned long start, end;
 	dev_t dev = 0;
 	int len;
 
@@ -227,13 +227,15 @@  static void show_map_vma(struct seq_file *m, struct vm_area_struct *vma)
 
 	/* We don't show the stack guard page in /proc/maps */
 	start = vma->vm_start;
-	if (vma->vm_flags & VM_GROWSDOWN)
-		if (!vma_stack_continue(vma->vm_prev, vma->vm_start))
-			start += PAGE_SIZE;
+	if (stack_guard_page_start(vma, start))
+		start += PAGE_SIZE;
+	end = vma->vm_end;
+	if (stack_guard_page_end(vma, end))
+		end -= PAGE_SIZE;
 
 	seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu %n",
 			start,
-			vma->vm_end,
+			end,
 			flags & VM_READ ? 'r' : '-',
 			flags & VM_WRITE ? 'w' : '-',
 			flags & VM_EXEC ? 'x' : '-',
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2348db26bc3d..6507dde38b16 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1011,11 +1011,33 @@  int set_page_dirty_lock(struct page *page);
 int clear_page_dirty_for_io(struct page *page);
 
 /* Is the vma a continuation of the stack vma above it? */
-static inline int vma_stack_continue(struct vm_area_struct *vma, unsigned long addr)
+static inline int vma_growsdown(struct vm_area_struct *vma, unsigned long addr)
 {
 	return vma && (vma->vm_end == addr) && (vma->vm_flags & VM_GROWSDOWN);
 }
 
+static inline int stack_guard_page_start(struct vm_area_struct *vma,
+					     unsigned long addr)
+{
+	return (vma->vm_flags & VM_GROWSDOWN) &&
+		(vma->vm_start == addr) &&
+		!vma_growsdown(vma->vm_prev, addr);
+}
+
+/* Is the vma a continuation of the stack vma below it? */
+static inline int vma_growsup(struct vm_area_struct *vma, unsigned long addr)
+{
+	return vma && (vma->vm_start == addr) && (vma->vm_flags & VM_GROWSUP);
+}
+
+static inline int stack_guard_page_end(struct vm_area_struct *vma,
+					   unsigned long addr)
+{
+	return (vma->vm_flags & VM_GROWSUP) &&
+		(vma->vm_end == addr) &&
+		!vma_growsup(vma->vm_next, addr);
+}
+
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len);
diff --git a/mm/memory.c b/mm/memory.c
index 27f425378112..61e66f026563 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1412,9 +1412,8 @@  no_page_table:
 
 static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr)
 {
-	return (vma->vm_flags & VM_GROWSDOWN) &&
-		(vma->vm_start == addr) &&
-		!vma_stack_continue(vma->vm_prev, addr);
+	return stack_guard_page_start(vma, addr) ||
+	       stack_guard_page_end(vma, addr+PAGE_SIZE);
 }
 
 /**
@@ -1551,12 +1550,6 @@  int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			continue;
 		}
 
-		/*
-		 * For mlock, just skip the stack guard page.
-		 */
-		if ((gup_flags & FOLL_MLOCK) && stack_guard_page(vma, start))
-			goto next_page;
-
 		do {
 			struct page *page;
 			unsigned int foll_flags = gup_flags;
@@ -1573,6 +1566,11 @@  int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 				int ret;
 				unsigned int fault_flags = 0;
 
+				/* For mlock, just skip the stack guard page. */
+				if (foll_flags & FOLL_MLOCK) {
+					if (stack_guard_page(vma, start))
+						goto next_page;
+				}
 				if (foll_flags & FOLL_WRITE)
 					fault_flags |= FAULT_FLAG_WRITE;
 				if (nonblocking)