From patchwork Mon May 9 22:07:41 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Linus Torvalds X-Patchwork-Id: 772202 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by demeter2.kernel.org (8.14.4/8.14.3) with ESMTP id p49MG3o7026697 for ; Mon, 9 May 2011 22:16:03 GMT Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752986Ab1EIWP7 (ORCPT ); Mon, 9 May 2011 18:15:59 -0400 Received: from smtp1.linux-foundation.org ([140.211.169.13]:50390 "EHLO smtp1.linux-foundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750997Ab1EIWP6 (ORCPT ); Mon, 9 May 2011 18:15:58 -0400 Received: from mail-ew0-f46.google.com (mail-ew0-f46.google.com [209.85.215.46]) (authenticated bits=0) by smtp1.linux-foundation.org (8.14.2/8.13.5/Debian-3ubuntu1.1) with ESMTP id p49MFS1Z002987 (version=TLSv1/SSLv3 cipher=RC4-SHA bits=128 verify=FAIL); Mon, 9 May 2011 15:15:29 -0700 Received: by ewy4 with SMTP id 4so1674562ewy.19 for ; Mon, 09 May 2011 15:15:25 -0700 (PDT) Received: by 10.14.13.66 with SMTP id a42mr3562262eea.193.1304978882331; Mon, 09 May 2011 15:08:02 -0700 (PDT) MIME-Version: 1.0 Received: by 10.14.127.144 with HTTP; Mon, 9 May 2011 15:07:41 -0700 (PDT) In-Reply-To: References: From: Linus Torvalds Date: Mon, 9 May 2011 15:07:41 -0700 Message-ID: Subject: Re: [PATCH] Don't mlock guardpage if the stack is growing up To: Mikulas Patocka , Tony Luck , Fenghua Yu Cc: Hugh Dickins , linux-kernel@vger.kernel.org, linux-parisc@vger.kernel.org, Michel Lespinasse , Oleg Nesterov , linux-ia64@vger.kernel.org X-Spam-Status: No, hits=-104.97 required=5 tests=AWL, BAYES_00, OSDL_HEADER_SUBJECT_BRACKETED, PATCH_SUBJECT_OSDL, USER_IN_WHITELIST X-Spam-Checker-Version: SpamAssassin 3.2.4-osdl_revision__1.47__ X-MIMEDefang-Filter: lf$Revision: 1.188 $ X-Scanned-By: MIMEDefang 2.63 on 140.211.169.13 Sender: linux-parisc-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-parisc@vger.kernel.org X-Greylist: IP, sender and recipient auto-whitelisted, not delayed by milter-greylist-4.2.6 (demeter2.kernel.org [140.211.167.43]); Mon, 09 May 2011 22:16:03 +0000 (UTC) On Mon, May 9, 2011 at 8:57 AM, Linus Torvalds wrote: > > Hmm. One thing that strikes me is this problem also implies that the > /proc/self/maps file is wrong for the GROWSUP case, isn't it? > > So I think we should not just apply your lock fix, but then *also* > apply something like this: Actually, I think we might be better off with something like this. It makes a few more changes: - move the stack guard page checking in __get_user_pages() into the rare case (ie we didn't find a page), since that's the only case we care about (the thing about the guard page is that don't want to call "handle_mm_fault()"). As a result, it's off any path where we can possibly care about performance, so we might as well have a nice helper function for both the grow-up and grow-down cases, instead of trying to be clever and only look at the grow-down case for the first page in the vma like you did in your patch. End result: simpler, more straightforward code. - Move the growsup/down helper functions to , since the /proc code really wants to use them too. That means that the "vma_stack_continue()" function (which now got split up into two cases, for the up/down cases) is now entirely just an internal helper function - nobody else uses it, and the real interface are the "stack_guard_page_xyz()" functions. Renamed to be simpler. - changed that naming of those stack_guard_page functions to use _start and _end instead of growsup/growsdown, since it actually takes the start or the end of the page as the argument (to match the semantics of the afore-mentioned helpers) - and finally, make /proc//maps use these helpers for both the up/down case, so now /proc/self/maps should work well for the growsup case too. Hmm? The only oddish case is IA64 that actually has a stack that grows *both* up and down. That means that I could make up a stack mapping that has a single virtual page in it, that is both the start *and* the end page. Now /proc/self/maps would actually show such a mapping with "negative" size. That's interesting. It would be easy enough to have a "if (end < start) end = start" there for that case, but maybe it's actually interesting information. Regardless, I'd like to hear whether this patch really does work on PA-RISC and especially IA64. I think those are the only cases that have a GROWSUP stack. And the IA64 case that supports both is the most interesting, everybody else does just one or the other. Linus fs/proc/task_mmu.c | 12 +++++++----- include/linux/mm.h | 24 +++++++++++++++++++++++- mm/memory.c | 16 +++++++--------- 3 files changed, 37 insertions(+), 15 deletions(-) diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 2e7addfd9803..318d8654989b 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -214,7 +214,7 @@ static void show_map_vma(struct seq_file *m, struct vm_area_struct *vma) int flags = vma->vm_flags; unsigned long ino = 0; unsigned long long pgoff = 0; - unsigned long start; + unsigned long start, end; dev_t dev = 0; int len; @@ -227,13 +227,15 @@ static void show_map_vma(struct seq_file *m, struct vm_area_struct *vma) /* We don't show the stack guard page in /proc/maps */ start = vma->vm_start; - if (vma->vm_flags & VM_GROWSDOWN) - if (!vma_stack_continue(vma->vm_prev, vma->vm_start)) - start += PAGE_SIZE; + if (stack_guard_page_start(vma, start)) + start += PAGE_SIZE; + end = vma->vm_end; + if (stack_guard_page_end(vma, end)) + end -= PAGE_SIZE; seq_printf(m, "%08lx-%08lx %c%c%c%c %08llx %02x:%02x %lu %n", start, - vma->vm_end, + end, flags & VM_READ ? 'r' : '-', flags & VM_WRITE ? 'w' : '-', flags & VM_EXEC ? 'x' : '-', diff --git a/include/linux/mm.h b/include/linux/mm.h index 2348db26bc3d..6507dde38b16 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1011,11 +1011,33 @@ int set_page_dirty_lock(struct page *page); int clear_page_dirty_for_io(struct page *page); /* Is the vma a continuation of the stack vma above it? */ -static inline int vma_stack_continue(struct vm_area_struct *vma, unsigned long addr) +static inline int vma_growsdown(struct vm_area_struct *vma, unsigned long addr) { return vma && (vma->vm_end == addr) && (vma->vm_flags & VM_GROWSDOWN); } +static inline int stack_guard_page_start(struct vm_area_struct *vma, + unsigned long addr) +{ + return (vma->vm_flags & VM_GROWSDOWN) && + (vma->vm_start == addr) && + !vma_growsdown(vma->vm_prev, addr); +} + +/* Is the vma a continuation of the stack vma below it? */ +static inline int vma_growsup(struct vm_area_struct *vma, unsigned long addr) +{ + return vma && (vma->vm_start == addr) && (vma->vm_flags & VM_GROWSUP); +} + +static inline int stack_guard_page_end(struct vm_area_struct *vma, + unsigned long addr) +{ + return (vma->vm_flags & VM_GROWSUP) && + (vma->vm_end == addr) && + !vma_growsup(vma->vm_next, addr); +} + extern unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len); diff --git a/mm/memory.c b/mm/memory.c index 27f425378112..61e66f026563 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1412,9 +1412,8 @@ no_page_table: static inline int stack_guard_page(struct vm_area_struct *vma, unsigned long addr) { - return (vma->vm_flags & VM_GROWSDOWN) && - (vma->vm_start == addr) && - !vma_stack_continue(vma->vm_prev, addr); + return stack_guard_page_start(vma, addr) || + stack_guard_page_end(vma, addr+PAGE_SIZE); } /** @@ -1551,12 +1550,6 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, continue; } - /* - * For mlock, just skip the stack guard page. - */ - if ((gup_flags & FOLL_MLOCK) && stack_guard_page(vma, start)) - goto next_page; - do { struct page *page; unsigned int foll_flags = gup_flags; @@ -1573,6 +1566,11 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, int ret; unsigned int fault_flags = 0; + /* For mlock, just skip the stack guard page. */ + if (foll_flags & FOLL_MLOCK) { + if (stack_guard_page(vma, start)) + goto next_page; + } if (foll_flags & FOLL_WRITE) fault_flags |= FAULT_FLAG_WRITE; if (nonblocking)