From patchwork Tue Sep 1 16:14:57 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sumit Semwal X-Patchwork-Id: 11749027 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2D9D2109A for ; Tue, 1 Sep 2020 16:15:33 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D25822065F for ; Tue, 1 Sep 2020 16:15:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=linaro.org header.i=@linaro.org header.b="dJKNWgN4" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D25822065F Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linaro.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id EDD5290000A; Tue, 1 Sep 2020 12:15:31 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id E8C2E900002; Tue, 1 Sep 2020 12:15:31 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D2DC490000A; Tue, 1 Sep 2020 12:15:31 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0244.hostedemail.com [216.40.44.244]) by kanga.kvack.org (Postfix) with ESMTP id AC7D5900002 for ; Tue, 1 Sep 2020 12:15:31 -0400 (EDT) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 653018248068 for ; Tue, 1 Sep 2020 16:15:31 +0000 (UTC) X-FDA: 77214992862.03.snake00_490dcbb27099 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin03.hostedemail.com (Postfix) with ESMTP id 2B08A28A4E9 for ; Tue, 1 Sep 2020 16:15:31 +0000 (UTC) X-Spam-Summary: 1,0,0,f99e5f16650674b8,d41d8cd98f00b204,sumit.semwal@linaro.org,,RULES_HIT:4:41:69:355:379:541:800:960:966:968:973:988:989:1260:1311:1314:1345:1359:1431:1437:1515:1605:1730:1747:1777:1792:2196:2199:2393:2553:2559:2562:2638:2693:2895:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4250:4321:4385:4605:5007:6119:6261:6653:6742:6743:7514:7576:7875:7903:8531:8660:9036:9592:10004:11026:11473:11658:11914:12043:12048:12291:12296:12297:12438:12517:12519:12555:12679:12683:12895:12986:13138:13148:13161:13229:13230:13231:13894:14096:14394:21080:21324:21433:21444:21451:21627:21939:21990:30003:30012:30054:30064:30070:30090,0,RBL:209.85.210.196:@linaro.org:.lbl8.mailshell.net-66.100.201.201 62.2.0.100;04ygxy1kta17jhugfo8fncaremn4jopcmybrbfjkgmwwx53dqhmmb3g3uci7y4t.7smgp6go175bp47itgj5xd1pwxpa4bfp3tnkb8hpctp7qnobw5o8io3e4qsirs4.g-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:ft,MSBL:0,DNSBL:n one,Cust X-HE-Tag: snake00_490dcbb27099 X-Filterd-Recvd-Size: 17559 Received: from mail-pf1-f196.google.com (mail-pf1-f196.google.com [209.85.210.196]) by imf28.hostedemail.com (Postfix) with ESMTP for ; Tue, 1 Sep 2020 16:15:30 +0000 (UTC) Received: by mail-pf1-f196.google.com with SMTP id f18so1054711pfa.10 for ; Tue, 01 Sep 2020 09:15:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=nJVS63TDgknUdWZCiTdwqO9WBbjG7Ud++6PslaqlIIU=; b=dJKNWgN4FO1epusp5or9OOgXJ4SQL0BXsZAw0DeujctQkdyD/oCIXJn9Z6hlkJNI7z TejuchCsleX7LVde83qsAq/IE6sskpofkBz6X4O/0Hwajc04IgBzx8fBVXtkMl8v8Hbq vOfdVNW4F4bCvdF2pFl3+9rh3rRX9QtmSdXN5b8Z/+X5tOWGDP5O9kG2kCHOIGKuik+d evgySkf1GPZEjI7joRodB9U/MhRzqC9mfZ4TM8l8Iu+1lkyt0Cycx/tF13HjM7/ZlmTv GKmrGhQLWK5gsrM9qrd5AMmjGI+DI0+vcUM23LlSUA08ZGlgYGT6pVXFT+0tJLfdEoFX Nspw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=nJVS63TDgknUdWZCiTdwqO9WBbjG7Ud++6PslaqlIIU=; b=Mvz9vgWf92DDwT7qp0JOP0PXqHtjy3Qs3g10cPW9SHJwg6BFaISVpBwSNI7AttCg8C HhJjQzkLs1qUtdijWGggEhWgWRp7itw38VgG2wfCaJRteb8VnYuhwhY7wBBPkPXW0Lb+ vSx4rMAjVC4kOp5AIEOzhbiMsXONCF+mQrT03FpQ5OSLlePoCBFFCrhOp0mmRG8DKetO xhYn7PHIrEUmJEJX5mpsiaFPnmLwUH5bqNGp8wRV5U++Ev0R/z0C0cZpXyH3KLaHztxU 53hPwY1p4CPJSBm30T6guyLN2NiS66cUnRFL1hjsr2Bv2hypJaOIfUicjah6xNVp2jf7 KYdw== X-Gm-Message-State: AOAM531vIQWgc8qh0ttjEZH38/GRlH1SUv9PAYUuvTpe6VALGxognldf zKiXABvw6w6J367zFcDzFRvs4w== X-Google-Smtp-Source: ABdhPJy+mW4RJpWDiVkCe9CbglEyBjCqPVdpHIC61+oqnC9xCd3lryxp82/PdpeLocZOyMLl4Z+fYg== X-Received: by 2002:a62:6104:: with SMTP id v4mr2506868pfb.207.1598976929384; Tue, 01 Sep 2020 09:15:29 -0700 (PDT) Received: from nagraj.lan ([175.100.146.50]) by smtp.gmail.com with ESMTPSA id d77sm2553169pfd.121.2020.09.01.09.15.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 01 Sep 2020 09:15:27 -0700 (PDT) From: Sumit Semwal To: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Alexey Dobriyan , Jonathan Corbet Cc: Mauro Carvalho Chehab , Kees Cook , Michal Hocko , Colin Cross , Alexey Gladkov , Matthew Wilcox , Jason Gunthorpe , "Kirill A . Shutemov" , Michel Lespinasse , =?utf-8?q?Michal_Koutn=C3=BD?= , Song Liu , Huang Ying , Vlastimil Babka , Yang Shi , chenqiwu , Mathieu Desnoyers , John Hubbard , Mike Christie , Bart Van Assche , Amit Pundir , Thomas Gleixner , Christian Brauner , Daniel Jordan , Adrian Reber , Nicolas Viennot , Al Viro , linux-fsdevel@vger.kernel.org, John Stultz , Pekka Enberg , Dave Hansen , Peter Zijlstra , Ingo Molnar , Oleg Nesterov , "Eric W. Biederman" , Jan Glauber , Rob Landley , Cyrill Gorcunov , "Serge E. Hallyn" , David Rientjes , Hugh Dickins , Rik van Riel , Mel Gorman , Tang Chen , Robin Holt , Shaohua Li , Sasha Levin , Johannes Weiner , Minchan Kim , Sumit Semwal Subject: [PATCH v7 1/3] mm: rearrange madvise code to allow for reuse Date: Tue, 1 Sep 2020 21:44:57 +0530 Message-Id: <20200901161459.11772-2-sumit.semwal@linaro.org> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20200901161459.11772-1-sumit.semwal@linaro.org> References: <20200901161459.11772-1-sumit.semwal@linaro.org> MIME-Version: 1.0 X-Rspamd-Queue-Id: 2B08A28A4E9 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam05 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Colin Cross Refactor the madvise syscall to allow for parts of it to be reused by a prctl syscall that affects vmas. Move the code that walks vmas in a virtual address range into a function that takes a function pointer as a parameter. The only caller for now is sys_madvise, which uses it to call madvise_vma_behavior on each vma, but the next patch will add an additional caller. Move handling all vma behaviors inside madvise_behavior, and rename it to madvise_vma_behavior. Move the code that updates the flags on a vma, including splitting or merging the vma as necessary, into a new function called madvise_update_vma. The next patch will add support for updating a new anon_name field as well. Signed-off-by: Colin Cross Cc: Pekka Enberg Cc: Dave Hansen Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Jan Glauber Cc: John Stultz Cc: Rob Landley Cc: Cyrill Gorcunov Cc: Kees Cook Cc: "Serge E. Hallyn" Cc: David Rientjes Cc: Al Viro Cc: Hugh Dickins Cc: Rik van Riel Cc: Mel Gorman Cc: Michel Lespinasse Cc: Tang Chen Cc: Robin Holt Cc: Shaohua Li Cc: Sasha Levin Cc: Johannes Weiner Cc: Minchan Kim Signed-off-by: Andrew Morton [sumits: rebased over v5.9-rc3] Signed-off-by: Sumit Semwal --- mm/madvise.c | 312 +++++++++++++++++++++++++++------------------------ 1 file changed, 168 insertions(+), 144 deletions(-) diff --git a/mm/madvise.c b/mm/madvise.c index dd1d43cf026d..84482c21b029 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -60,76 +60,20 @@ static int madvise_need_mmap_write(int behavior) } /* - * We can potentially split a vm area into separate - * areas, each area with its own behavior. + * Update the vm_flags on regiion of a vma, splitting it or merging it as + * necessary. Must be called with mmap_sem held for writing; */ -static long madvise_behavior(struct vm_area_struct *vma, - struct vm_area_struct **prev, - unsigned long start, unsigned long end, int behavior) +static int madvise_update_vma(struct vm_area_struct *vma, + struct vm_area_struct **prev, unsigned long start, + unsigned long end, unsigned long new_flags) { struct mm_struct *mm = vma->vm_mm; - int error = 0; + int error; pgoff_t pgoff; - unsigned long new_flags = vma->vm_flags; - - switch (behavior) { - case MADV_NORMAL: - new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ; - break; - case MADV_SEQUENTIAL: - new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ; - break; - case MADV_RANDOM: - new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ; - break; - case MADV_DONTFORK: - new_flags |= VM_DONTCOPY; - break; - case MADV_DOFORK: - if (vma->vm_flags & VM_IO) { - error = -EINVAL; - goto out; - } - new_flags &= ~VM_DONTCOPY; - break; - case MADV_WIPEONFORK: - /* MADV_WIPEONFORK is only supported on anonymous memory. */ - if (vma->vm_file || vma->vm_flags & VM_SHARED) { - error = -EINVAL; - goto out; - } - new_flags |= VM_WIPEONFORK; - break; - case MADV_KEEPONFORK: - new_flags &= ~VM_WIPEONFORK; - break; - case MADV_DONTDUMP: - new_flags |= VM_DONTDUMP; - break; - case MADV_DODUMP: - if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) { - error = -EINVAL; - goto out; - } - new_flags &= ~VM_DONTDUMP; - break; - case MADV_MERGEABLE: - case MADV_UNMERGEABLE: - error = ksm_madvise(vma, start, end, behavior, &new_flags); - if (error) - goto out_convert_errno; - break; - case MADV_HUGEPAGE: - case MADV_NOHUGEPAGE: - error = hugepage_madvise(vma, &new_flags, behavior); - if (error) - goto out_convert_errno; - break; - } if (new_flags == vma->vm_flags) { *prev = vma; - goto out; + return 0; } pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); @@ -146,21 +90,21 @@ static long madvise_behavior(struct vm_area_struct *vma, if (start != vma->vm_start) { if (unlikely(mm->map_count >= sysctl_max_map_count)) { error = -ENOMEM; - goto out; + return error; } error = __split_vma(mm, vma, start, 1); if (error) - goto out_convert_errno; + return error; } if (end != vma->vm_end) { if (unlikely(mm->map_count >= sysctl_max_map_count)) { error = -ENOMEM; - goto out; + return error; } error = __split_vma(mm, vma, end, 0); if (error) - goto out_convert_errno; + return error; } success: @@ -169,15 +113,7 @@ static long madvise_behavior(struct vm_area_struct *vma, */ vma->vm_flags = new_flags; -out_convert_errno: - /* - * madvise() returns EAGAIN if kernel resources, such as - * slab, are temporarily unavailable. - */ - if (error == -ENOMEM) - error = -EAGAIN; -out: - return error; + return 0; } #ifdef CONFIG_SWAP @@ -862,6 +798,93 @@ static long madvise_remove(struct vm_area_struct *vma, return error; } +/* + * Apply an madvise behavior to a region of a vma. madvise_update_vma + * will handle splitting a vm area into separate areas, each area with its own + * behavior. + */ +static int madvise_vma_behavior(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end, + unsigned long behavior) +{ + int error = 0; + unsigned long new_flags = vma->vm_flags; + + switch (behavior) { + case MADV_REMOVE: + return madvise_remove(vma, prev, start, end); + case MADV_WILLNEED: + return madvise_willneed(vma, prev, start, end); + case MADV_COLD: + return madvise_cold(vma, prev, start, end); + case MADV_PAGEOUT: + return madvise_pageout(vma, prev, start, end); + case MADV_FREE: + case MADV_DONTNEED: + return madvise_dontneed_free(vma, prev, start, end, behavior); + case MADV_NORMAL: + new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ; + break; + case MADV_SEQUENTIAL: + new_flags = (new_flags & ~VM_RAND_READ) | VM_SEQ_READ; + break; + case MADV_RANDOM: + new_flags = (new_flags & ~VM_SEQ_READ) | VM_RAND_READ; + break; + case MADV_DONTFORK: + new_flags |= VM_DONTCOPY; + break; + case MADV_DOFORK: + if (vma->vm_flags & VM_IO) { + error = -EINVAL; + goto out; + } + new_flags &= ~VM_DONTCOPY; + break; + case MADV_WIPEONFORK: + /* MADV_WIPEONFORK is only supported on anonymous memory. */ + if (vma->vm_file || vma->vm_flags & VM_SHARED) { + error = -EINVAL; + goto out; + } + new_flags |= VM_WIPEONFORK; + break; + case MADV_KEEPONFORK: + new_flags &= ~VM_WIPEONFORK; + break; + case MADV_DONTDUMP: + new_flags |= VM_DONTDUMP; + break; + case MADV_DODUMP: + if (!is_vm_hugetlb_page(vma) && new_flags & VM_SPECIAL) { + error = -EINVAL; + goto out; + } + new_flags &= ~VM_DONTDUMP; + break; + case MADV_MERGEABLE: + case MADV_UNMERGEABLE: + error = ksm_madvise(vma, start, end, behavior, &new_flags); + if (error) + goto out; + break; + case MADV_HUGEPAGE: + case MADV_NOHUGEPAGE: + error = hugepage_madvise(vma, &new_flags, behavior); + if (error) + goto out; + break; + } + + error = madvise_update_vma(vma, prev, start, end, new_flags); + +out: + if (error == -ENOMEM) + error = -EAGAIN; + return error; +} + #ifdef CONFIG_MEMORY_FAILURE /* * Error injection support for memory error handling. @@ -931,27 +954,6 @@ static int madvise_inject_error(int behavior, } #endif -static long -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, - unsigned long start, unsigned long end, int behavior) -{ - switch (behavior) { - case MADV_REMOVE: - return madvise_remove(vma, prev, start, end); - case MADV_WILLNEED: - return madvise_willneed(vma, prev, start, end); - case MADV_COLD: - return madvise_cold(vma, prev, start, end); - case MADV_PAGEOUT: - return madvise_pageout(vma, prev, start, end); - case MADV_FREE: - case MADV_DONTNEED: - return madvise_dontneed_free(vma, prev, start, end, behavior); - default: - return madvise_behavior(vma, prev, start, end, behavior); - } -} - static bool madvise_behavior_valid(int behavior) { @@ -990,6 +992,73 @@ madvise_behavior_valid(int behavior) } } +/* + * Walk the vmas in range [start,end), and call the visit function on each one. + * The visit function will get start and end parameters that cover the overlap + * between the current vma and the original range. Any unmapped regions in the + * original range will result in this function returning -ENOMEM while still + * calling the visit function on all of the existing vmas in the range. + * Must be called with the mmap_lock held for reading or writing. + */ +static +int madvise_walk_vmas(unsigned long start, unsigned long end, + unsigned long arg, + int (*visit)(struct vm_area_struct *vma, + struct vm_area_struct **prev, unsigned long start, + unsigned long end, unsigned long arg)) +{ + struct vm_area_struct *vma; + struct vm_area_struct *prev; + unsigned long tmp; + int unmapped_error = 0; + + /* + * If the interval [start,end) covers some unmapped address + * ranges, just ignore them, but return -ENOMEM at the end. + * - different from the way of handling in mlock etc. + */ + vma = find_vma_prev(current->mm, start, &prev); + if (vma && start > vma->vm_start) + prev = vma; + + for (;;) { + int error; + + /* Still start < end. */ + if (!vma) + return -ENOMEM; + + /* Here start < (end|vma->vm_end). */ + if (start < vma->vm_start) { + unmapped_error = -ENOMEM; + start = vma->vm_start; + if (start >= end) + break; + } + + /* Here vma->vm_start <= start < (end|vma->vm_end) */ + tmp = vma->vm_end; + if (end < tmp) + tmp = end; + + /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ + error = visit(vma, &prev, start, tmp, arg); + if (error) + return error; + start = tmp; + if (prev && start < prev->vm_end) + start = prev->vm_end; + if (start >= end) + break; + if (prev) + vma = prev->vm_next; + else /* madvise_remove dropped mmap_lock */ + vma = find_vma(current->mm, start); + } + + return unmapped_error; +} + /* * The madvise(2) system call. * @@ -1053,9 +1122,7 @@ madvise_behavior_valid(int behavior) */ int do_madvise(unsigned long start, size_t len_in, int behavior) { - unsigned long end, tmp; - struct vm_area_struct *vma, *prev; - int unmapped_error = 0; + unsigned long end; int error = -EINVAL; int write; size_t len; @@ -1112,51 +1179,8 @@ int do_madvise(unsigned long start, size_t len_in, int behavior) mmap_read_lock(current->mm); } - /* - * If the interval [start,end) covers some unmapped address - * ranges, just ignore them, but return -ENOMEM at the end. - * - different from the way of handling in mlock etc. - */ - vma = find_vma_prev(current->mm, start, &prev); - if (vma && start > vma->vm_start) - prev = vma; - blk_start_plug(&plug); - for (;;) { - /* Still start < end. */ - error = -ENOMEM; - if (!vma) - goto out; - - /* Here start < (end|vma->vm_end). */ - if (start < vma->vm_start) { - unmapped_error = -ENOMEM; - start = vma->vm_start; - if (start >= end) - goto out; - } - - /* Here vma->vm_start <= start < (end|vma->vm_end) */ - tmp = vma->vm_end; - if (end < tmp) - tmp = end; - - /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ - error = madvise_vma(vma, &prev, start, tmp, behavior); - if (error) - goto out; - start = tmp; - if (prev && start < prev->vm_end) - start = prev->vm_end; - error = unmapped_error; - if (start >= end) - goto out; - if (prev) - vma = prev->vm_next; - else /* madvise_remove dropped mmap_lock */ - vma = find_vma(current->mm, start); - } -out: + error = madvise_walk_vmas(start, end, behavior, madvise_vma_behavior); blk_finish_plug(&plug); if (write) mmap_write_unlock(current->mm); From patchwork Tue Sep 1 16:14:58 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sumit Semwal X-Patchwork-Id: 11749029 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6B306618 for ; Tue, 1 Sep 2020 16:15:46 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1FB292065F for ; Tue, 1 Sep 2020 16:15:46 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=linaro.org header.i=@linaro.org header.b="DMJZzRBu" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1FB292065F Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linaro.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 518A590000C; Tue, 1 Sep 2020 12:15:45 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 4EDB8900002; Tue, 1 Sep 2020 12:15:45 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 404F490000C; Tue, 1 Sep 2020 12:15:45 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0038.hostedemail.com [216.40.44.38]) by kanga.kvack.org (Postfix) with ESMTP id 26D93900002 for ; Tue, 1 Sep 2020 12:15:45 -0400 (EDT) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id CFD6F3623 for ; Tue, 1 Sep 2020 16:15:44 +0000 (UTC) X-FDA: 77214993408.29.woman31_281194227099 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin29.hostedemail.com (Postfix) with ESMTP id 747BE180868FA for ; Tue, 1 Sep 2020 16:15:41 +0000 (UTC) X-Spam-Summary: 1,0,0,4fa8876e9a2d3b40,d41d8cd98f00b204,sumit.semwal@linaro.org,,RULES_HIT:41:355:379:541:800:960:973:988:989:1260:1311:1314:1345:1359:1437:1515:1535:1543:1605:1711:1730:1747:1777:1792:2194:2199:2393:2559:2562:2901:3138:3139:3140:3141:3142:3865:3866:3867:3870:3871:3874:4118:4321:4605:5007:6261:6653:6742:6743:7875:7903:9592:10004:11026:11473:11658:11914:12043:12048:12114:12291:12296:12297:12438:12517:12519:12555:12683:12895:13141:13230:13894:14096:14181:14394:14721:21080:21324:21444:21450:21451:21627:21990:30003:30029:30034:30054:30070,0,RBL:209.85.215.194:@linaro.org:.lbl8.mailshell.net-66.100.201.201 62.2.0.100;04ygzdzyh7o9uyy3dd7mo6sooc38mocphp67ca73b4skywgcq188uekcixxhc57.ry6y49xzyjkntswfsi5wggekgd8qboe957zz1o5keg7xhmoamzu61t6fhxzioej.o-lbl8.mailshell.net-223.238.255.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:45,LUA_SUMMARY:none X-HE-Tag: woman31_281194227099 X-Filterd-Recvd-Size: 7846 Received: from mail-pg1-f194.google.com (mail-pg1-f194.google.com [209.85.215.194]) by imf09.hostedemail.com (Postfix) with ESMTP for ; Tue, 1 Sep 2020 16:15:39 +0000 (UTC) Received: by mail-pg1-f194.google.com with SMTP id m5so925354pgj.9 for ; Tue, 01 Sep 2020 09:15:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=U1h0WFxrmlZti/5PdriJmCxO8e4BuJ9nFjihIY3qycs=; b=DMJZzRBurdYUN4x8gReiESvjIjB6WMHksdeGaRDN41WunAsqZ5Yvt1bO5dntkRFy3t 8wwhIkp4rgfa2J7aKnYStujtMWjfHwnF6BkSHyc7XPqxl15S8Wx58seJIlrKeXoJmuGj MQaukCMelikiVQ/1Wm711W7exb+6DNt43lv72RbxdW6K74pL+q9Y56zYnklCSiMeeL35 pRebHGva3Zi2MmJ3dxDLMqa5QnCLLLYiGadJKrLiaLGeasmp+w0PpyFwxsv0zQw8Or3z 1eho9mmEcHG7AK6OcGSov1ce9KH+R9Js6Mtep8s4aW9yT+PGaaaWQiXvT8tY8YznovIj 7SWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=U1h0WFxrmlZti/5PdriJmCxO8e4BuJ9nFjihIY3qycs=; b=WBUuCb9b9YhDUMqSNRYLYHfjyaHB8EwB6HqqaEGNTMy/1nfJbbVVlidgZPUTMCnP6t yOb8MZ2z7QmBCrB5oi33bRcyDnCRIOqE0tCBbeAjKYGvNq2G4hl5tJtWQzSQ5vVn/C3c Vou/PV7pwflA9wmijPoVjcibM62pCsgY9GKn+1jpbCJ8DdVh6S9onYhoWSY/m+JAp2fD uedC3szVXUohR7nvcjnSwL231cAgVo/1XwVv7Cvku9Ri3/ymD7NDjcg9ncyScF7a70sU oLSKA/sXdjrqgEQXxeNKWHH3SxBO7OrWq7/B4MTwVcNqb7TDCQPXM75vaWMMo1qpmpo/ FABg== X-Gm-Message-State: AOAM531BNv1cHz1S7WpWCF1ubu6ab38hLYBsIbVAXk9Sm/LcQvs80wGl H50aAF2POP3cEmEJF9D3CezCDA== X-Google-Smtp-Source: ABdhPJzWV89RR/QWyu3dI160WKOjSKjBNJugNB1VuwvaljQHlhcb659AvPrOk3k8yapRkjnyxkRAeg== X-Received: by 2002:a63:1b42:: with SMTP id b2mr1968103pgm.397.1598976938667; Tue, 01 Sep 2020 09:15:38 -0700 (PDT) Received: from nagraj.lan ([175.100.146.50]) by smtp.gmail.com with ESMTPSA id d77sm2553169pfd.121.2020.09.01.09.15.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 01 Sep 2020 09:15:37 -0700 (PDT) From: Sumit Semwal To: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Alexey Dobriyan , Jonathan Corbet Cc: Mauro Carvalho Chehab , Kees Cook , Michal Hocko , Colin Cross , Alexey Gladkov , Matthew Wilcox , Jason Gunthorpe , "Kirill A . Shutemov" , Michel Lespinasse , =?utf-8?q?Michal_Koutn=C3=BD?= , Song Liu , Huang Ying , Vlastimil Babka , Yang Shi , chenqiwu , Mathieu Desnoyers , John Hubbard , Mike Christie , Bart Van Assche , Amit Pundir , Thomas Gleixner , Christian Brauner , Daniel Jordan , Adrian Reber , Nicolas Viennot , Al Viro , linux-fsdevel@vger.kernel.org, John Stultz , Sumit Semwal Subject: [PATCH v7 2/3] mm: memory: Add access_remote_vm_locked variant Date: Tue, 1 Sep 2020 21:44:58 +0530 Message-Id: <20200901161459.11772-3-sumit.semwal@linaro.org> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20200901161459.11772-1-sumit.semwal@linaro.org> References: <20200901161459.11772-1-sumit.semwal@linaro.org> MIME-Version: 1.0 X-Rspamd-Queue-Id: 747BE180868FA X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam03 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This allows accessing a remote vm while the mmap_lock is already held by the caller. While adding support for anonymous vma naming, show_map_vma() needs to access the remote vm to get the name of the anonymous vma. Since show_map_vma() already holds the mmap_lock, so this _locked variant was required. Signed-off-by: Sumit Semwal --- include/linux/mm.h | 2 ++ mm/memory.c | 49 ++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 45 insertions(+), 6 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index ca6e6a81576b..e9212c0bb5ac 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1708,6 +1708,8 @@ extern int access_remote_vm(struct mm_struct *mm, unsigned long addr, void *buf, int len, unsigned int gup_flags); extern int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm, unsigned long addr, void *buf, int len, unsigned int gup_flags); +extern int access_remote_vm_locked(struct mm_struct *mm, unsigned long addr, + void *buf, int len, unsigned int gup_flags); long get_user_pages_remote(struct mm_struct *mm, unsigned long start, unsigned long nr_pages, diff --git a/mm/memory.c b/mm/memory.c index 602f4283122f..207be99390e9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4726,17 +4726,17 @@ EXPORT_SYMBOL_GPL(generic_access_phys); /* * Access another process' address space as given in mm. If non-NULL, use the * given task for page fault accounting. + * This variant assumes that the mmap_lock is already held by the caller, so + * doesn't take the mmap_lock. */ -int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm, - unsigned long addr, void *buf, int len, unsigned int gup_flags) +int __access_remote_vm_locked(struct task_struct *tsk, struct mm_struct *mm, + unsigned long addr, void *buf, int len, + unsigned int gup_flags) { struct vm_area_struct *vma; void *old_buf = buf; int write = gup_flags & FOLL_WRITE; - if (mmap_read_lock_killable(mm)) - return 0; - /* ignore errors, just check how much was successfully transferred */ while (len) { int bytes, ret, offset; @@ -4785,9 +4785,46 @@ int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm, buf += bytes; addr += bytes; } + return buf - old_buf; +} + +/* + * Access another process' address space as given in mm. If non-NULL, use the + * given task for page fault accounting. + */ +int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm, + unsigned long addr, void *buf, int len, unsigned int gup_flags) +{ + int ret; + + if (mmap_read_lock_killable(mm)) + return 0; + + ret = __access_remote_vm_locked(tsk, mm, addr, buf, len, gup_flags); mmap_read_unlock(mm); - return buf - old_buf; + return ret; +} + +/** + * access_remote_vm_locked - access another process' address space, without + * taking the mmap_lock. This allows nested calls from callers that already have + * taken the lock. + * + * @mm: the mm_struct of the target address space + * @addr: start address to access + * @buf: source or destination buffer + * @len: number of bytes to transfer + * @gup_flags: flags modifying lookup behaviour + * + * The caller must hold a reference on @mm, as well as hold the mmap_lock + * + * Return: number of bytes copied from source to destination. + */ +int access_remote_vm_locked(struct mm_struct *mm, unsigned long addr, void *buf, + int len, unsigned int gup_flags) +{ + return __access_remote_vm_locked(NULL, mm, addr, buf, len, gup_flags); } /** From patchwork Tue Sep 1 16:14:59 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Sumit Semwal X-Patchwork-Id: 11749033 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9F1D8618 for ; Tue, 1 Sep 2020 16:15:58 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 42F012078B for ; Tue, 1 Sep 2020 16:15:58 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=linaro.org header.i=@linaro.org header.b="vMlINdI0" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 42F012078B Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linaro.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 53EB9900010; Tue, 1 Sep 2020 12:15:57 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 4EE26900002; Tue, 1 Sep 2020 12:15:57 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3B76A900010; Tue, 1 Sep 2020 12:15:57 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0009.hostedemail.com [216.40.44.9]) by kanga.kvack.org (Postfix) with ESMTP id 22F13900002 for ; Tue, 1 Sep 2020 12:15:57 -0400 (EDT) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id CD3DB8248047 for ; Tue, 1 Sep 2020 16:15:56 +0000 (UTC) X-FDA: 77214993912.22.skate59_0616fb627099 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin22.hostedemail.com (Postfix) with ESMTP id 85CF618038E6A for ; Tue, 1 Sep 2020 16:15:56 +0000 (UTC) X-Spam-Summary: 1,0,0,52d88f3a91bcc540,d41d8cd98f00b204,sumit.semwal@linaro.org,,RULES_HIT:41:327:355:379:541:800:960:966:968:973:981:988:989:1260:1311:1314:1345:1359:1431:1437:1515:1605:1730:1747:1777:1792:1801:2196:2198:2199:2200:2393:2553:2559:2562:2692:2693:2740:2890:2892:2899:2910:2911:3138:3139:3140:3141:3142:3571:3743:3865:3866:3867:3868:3870:3871:3872:3874:4042:4250:4321:4385:4425:4605:5007:6119:6261:6653:6742:6743:7514:7576:7875:7903:8660:8957:9008:9036:9163:10004:10128:11026:11232:11473:11657:11658:11914:12043:12048:12291:12295:12296:12297:12438:12517:12519:12555:12679:12683:12895:12986:13138:13146:13148:13161:13229:13230:13231:13894:13972:14394:14877:21080:21222:21324:21433:21444:21450:21451:21627:21790:21795:21939:21987:21990:30001:30003:30029:30030:30034:30051:30054:30064:30069:30070:30075:30090,0,RBL:209.85.210.194:@linaro.org:.lbl8.mailshell.net-62.2.0.100 66.100.201.201;04yfd41um1f7n98me4mp3q933xegqop3dr1596mnymuk9jduwfz74npq1x9qtmj.tk49wisq7565jyz8bfgps58mc7d 6wwqi4zy X-HE-Tag: skate59_0616fb627099 X-Filterd-Recvd-Size: 29434 Received: from mail-pf1-f194.google.com (mail-pf1-f194.google.com [209.85.210.194]) by imf38.hostedemail.com (Postfix) with ESMTP for ; Tue, 1 Sep 2020 16:15:55 +0000 (UTC) Received: by mail-pf1-f194.google.com with SMTP id k15so1050525pfc.12 for ; Tue, 01 Sep 2020 09:15:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=KOe8YjW6EpqdaHB3EWOQwG67hn4tw6gcoJ4up9R9Yzw=; b=vMlINdI00kP+9PF9SoBN3R1Wa+9syQz1vuU20Crt0CyfNFUf2kcEis373Q7hAI5Zg1 k6vaktpb9vgifkOR3Q74HSgrIbEoq02wvEC/2IZGKgq5v++JE82TcciB4QOq4vAjfnla CdYw64atytaYnuacLbNQXsW7yrHzdE5Lh0l6fzv+t5ZXsmnqqt6ocJqei8OVaoMd+65n Ho2dKrqrbgYY5+HPuDJIjnKuuT9Mk9+hgmF0htBqa5GCoyVaoy5OYKrNxxh2yX3S7Qkx LHS4I695Pg6JWTcQYQvf2fM77kyXCJtkuoWi8wTwq/Xm+YJK1/zU4RLfBHFMyj3sDu+B v9AA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=KOe8YjW6EpqdaHB3EWOQwG67hn4tw6gcoJ4up9R9Yzw=; b=HjXoN8zeIu96lT9mv1e4XMCHHOukOkUmovufd8nsZ/RfrQXKS9sby8W9A3szAFNawF dl/jMtJwcmaSRS//Wm7kjNCFd4K3TLPcPqz4wNsovdeeD1YFEr0qhcLlHVCONpackiz0 9BZZHYZlwp0sV/OeBBTQZ2yv/WEj5bRx6tkpQfiWdzC/5zIzV5pAviUfSyLGz7pMkl6L vIQiEpy8msVTIEO3xHBBphtgEYxeXWFUs9R8EM69mHRaQvmQiJ3UFxUYq2wYcAWboasW axmbiw0Y9y96IVGVET3+bcxkDOVCSsVAEjt1axg6OD9IdIPsO+ZM6n61Y38jyuCmJ5Yc W1ng== X-Gm-Message-State: AOAM531Br4jtDzuBAivyGMDzwaOWi0x6cUnwJGwPYzga9kGHXKt3Ivmq DLp6VzUZC5v9yfgYeT+RmT+4qg== X-Google-Smtp-Source: ABdhPJwIxQxpk0vNWF8RWReF5lHvni6CQd7VfYN5EXgGUOv5GCaZdhOgls3BuWGXJ3WXudeRRUthUg== X-Received: by 2002:a63:dd0c:: with SMTP id t12mr2031342pgg.243.1598976954759; Tue, 01 Sep 2020 09:15:54 -0700 (PDT) Received: from nagraj.lan ([175.100.146.50]) by smtp.gmail.com with ESMTPSA id d77sm2553169pfd.121.2020.09.01.09.15.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 01 Sep 2020 09:15:53 -0700 (PDT) From: Sumit Semwal To: Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Alexey Dobriyan , Jonathan Corbet Cc: Mauro Carvalho Chehab , Kees Cook , Michal Hocko , Colin Cross , Alexey Gladkov , Matthew Wilcox , Jason Gunthorpe , "Kirill A . Shutemov" , Michel Lespinasse , =?utf-8?q?Michal_Koutn=C3=BD?= , Song Liu , Huang Ying , Vlastimil Babka , Yang Shi , chenqiwu , Mathieu Desnoyers , John Hubbard , Mike Christie , Bart Van Assche , Amit Pundir , Thomas Gleixner , Christian Brauner , Daniel Jordan , Adrian Reber , Nicolas Viennot , Al Viro , linux-fsdevel@vger.kernel.org, John Stultz , Pekka Enberg , Dave Hansen , Peter Zijlstra , Ingo Molnar , Oleg Nesterov , "Eric W. Biederman" , Jan Glauber , Rob Landley , Cyrill Gorcunov , "Serge E. Hallyn" , David Rientjes , Hugh Dickins , Rik van Riel , Mel Gorman , Tang Chen , Robin Holt , Shaohua Li , Sasha Levin , Johannes Weiner , Minchan Kim , Sumit Semwal Subject: [PATCH v7 3/3] mm: add a field to store names for private anonymous memory Date: Tue, 1 Sep 2020 21:44:59 +0530 Message-Id: <20200901161459.11772-4-sumit.semwal@linaro.org> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20200901161459.11772-1-sumit.semwal@linaro.org> References: <20200901161459.11772-1-sumit.semwal@linaro.org> MIME-Version: 1.0 X-Rspamd-Queue-Id: 85CF618038E6A X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam05 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Colin Cross In many userspace applications, and especially in VM based applications like Android uses heavily, there are multiple different allocators in use. At a minimum there is libc malloc and the stack, and in many cases there are libc malloc, the stack, direct syscalls to mmap anonymous memory, and multiple VM heaps (one for small objects, one for big objects, etc.). Each of these layers usually has its own tools to inspect its usage; malloc by compiling a debug version, the VM through heap inspection tools, and for direct syscalls there is usually no way to track them. On Android we heavily use a set of tools that use an extended version of the logic covered in Documentation/vm/pagemap.txt to walk all pages mapped in userspace and slice their usage by process, shared (COW) vs. unique mappings, backing, etc. This can account for real physical memory usage even in cases like fork without exec (which Android uses heavily to share as many private COW pages as possible between processes), Kernel SamePage Merging, and clean zero pages. It produces a measurement of the pages that only exist in that process (USS, for unique), and a measurement of the physical memory usage of that process with the cost of shared pages being evenly split between processes that share them (PSS). If all anonymous memory is indistinguishable then figuring out the real physical memory usage (PSS) of each heap requires either a pagemap walking tool that can understand the heap debugging of every layer, or for every layer's heap debugging tools to implement the pagemap walking logic, in which case it is hard to get a consistent view of memory across the whole system. Tracking the information in userspace leads to all sorts of problems. It either needs to be stored inside the process, which means every process has to have an API to export its current heap information upon request, or it has to be stored externally in a filesystem that somebody needs to clean up on crashes. It needs to be readable while the process is still running, so it has to have some sort of synchronization with every layer of userspace. Efficiently tracking the ranges requires reimplementing something like the kernel vma trees, and linking to it from every layer of userspace. It requires more memory, more syscalls, more runtime cost, and more complexity to separately track regions that the kernel is already tracking. This patch adds a field to /proc/pid/maps and /proc/pid/smaps to show a userspace-provided name for anonymous vmas. The names of named anonymous vmas are shown in /proc/pid/maps and /proc/pid/smaps as [anon:]. Userspace can set the name for a region of memory by calling prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, start, len, (unsigned long)name); Setting the name to NULL clears it. The name is stored in a user pointer in the shared union in vm_area_struct that points to a null terminated string inside the user process. vmas that point to the same address and are otherwise mergeable will be merged, but vmas that point to equivalent strings at different addresses will not be merged. The idea to store a userspace pointer to reduce the complexity within mm (at the expense of the complexity of reading /proc/pid/mem) came from Dave Hansen. This results in no runtime overhead in the mm subsystem other than comparing the anon_name pointers when considering vma merging. The pointer is stored in a union with fields that are only used on file-backed mappings, so it does not increase memory usage. (Upstream changed to remove the union, so this patch adds it back as well) Signed-off-by: Colin Cross Cc: Pekka Enberg Cc: Dave Hansen Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Oleg Nesterov Cc: "Eric W. Biederman" Cc: Jan Glauber Cc: John Stultz Cc: Rob Landley Cc: Cyrill Gorcunov Cc: Kees Cook Cc: "Serge E. Hallyn" Cc: David Rientjes Cc: Al Viro Cc: Hugh Dickins Cc: Rik van Riel Cc: Mel Gorman Cc: Michel Lespinasse Cc: Tang Chen Cc: Robin Holt Cc: Shaohua Li Cc: Sasha Levin Cc: Johannes Weiner Cc: Minchan Kim Signed-off-by: Andrew Morton Signed-off-by: Sumit Semwal --- v2: updates the commit message to explain in more detail why the patch is useful. v3: renames vma_get_anon_name to vma_anon_name replaces logic in seq_print_vma_name with access_process_vm removes Name: entry from smaps, it's already on the header line changes the prctl option number to match what is currently in use on Android v4: adds paragraph to commit log on why this is better than tracking in userspace squashes fixes from Andrew Morton to fix build error and warning fix build error reported by Mark Salter when !CONFIG_MMU v5: rebased to v5.9-rc1, added minor fixes to match upstream v6: rebased to v5.9-rc3, and addressed review comments: - added missing callers in fs/userfaultd.c - simplified the union - use the new access_remote_vm_locked() in show_map_vma() since that already holds mmap_lock v7: fixed randconfig build failure when CONFIG_ADVISE_SYSCALLS isn't defined --- Documentation/filesystems/proc.rst | 2 ++ fs/proc/task_mmu.c | 24 ++++++++++++- fs/userfaultfd.c | 7 ++-- include/linux/mm.h | 12 ++++++- include/linux/mm_types.h | 25 +++++++++++--- include/uapi/linux/prctl.h | 3 ++ kernel/sys.c | 32 ++++++++++++++++++ mm/interval_tree.c | 2 +- mm/madvise.c | 54 +++++++++++++++++++++++++++--- mm/mempolicy.c | 3 +- mm/mlock.c | 2 +- mm/mmap.c | 38 ++++++++++++--------- mm/mprotect.c | 2 +- 13 files changed, 173 insertions(+), 33 deletions(-) diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index 533c79e8d2cd..41a9cea73b8b 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -429,6 +429,8 @@ is not associated with a file: [stack] the stack of the main process [vdso] the "virtual dynamic shared object", the kernel system call handler +[anon:] an anonymous mapping that has been + named by userspace ======= ==================================== or if empty, the mapping is anonymous. diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 5066b0251ed8..290ce5ecad0d 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -97,6 +97,21 @@ unsigned long task_statm(struct mm_struct *mm, return mm->total_vm; } +static void seq_print_vma_name(struct seq_file *m, struct vm_area_struct *vma) +{ + struct mm_struct *mm = vma->vm_mm; + char anon_name[NAME_MAX + 1]; + int n; + + n = access_remote_vm_locked(mm, (unsigned long)vma_anon_name(vma), anon_name, + NAME_MAX, 0); + if (n > 0) { + seq_puts(m, "[anon:"); + seq_write(m, anon_name, strnlen(anon_name, n)); + seq_putc(m, ']'); + } +} + #ifdef CONFIG_NUMA /* * Save get_task_policy() for show_numa_map(). @@ -319,8 +334,15 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma) goto done; } - if (is_stack(vma)) + if (is_stack(vma)) { name = "[stack]"; + goto done; + } + + if (vma_anon_name(vma)) { + seq_pad(m, ' '); + seq_print_vma_name(m, vma); + } } done: diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 0e4a3837da52..0741fc9c57c6 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -874,7 +874,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file) new_flags, vma->anon_vma, vma->vm_file, vma->vm_pgoff, vma_policy(vma), - NULL_VM_UFFD_CTX); + NULL_VM_UFFD_CTX, vma_anon_name(vma)); if (prev) vma = prev; else @@ -1425,7 +1425,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx, prev = vma_merge(mm, prev, start, vma_end, new_flags, vma->anon_vma, vma->vm_file, vma->vm_pgoff, vma_policy(vma), - ((struct vm_userfaultfd_ctx){ ctx })); + ((struct vm_userfaultfd_ctx){ ctx }), + vma_anon_name(vma)); if (prev) { vma = prev; goto next; @@ -1597,7 +1598,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx, prev = vma_merge(mm, prev, start, vma_end, new_flags, vma->anon_vma, vma->vm_file, vma->vm_pgoff, vma_policy(vma), - NULL_VM_UFFD_CTX); + NULL_VM_UFFD_CTX, vma_anon_name(vma)); if (prev) { vma = prev; goto next; diff --git a/include/linux/mm.h b/include/linux/mm.h index e9212c0bb5ac..430bfd449015 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2491,7 +2491,7 @@ static inline int vma_adjust(struct vm_area_struct *vma, unsigned long start, extern struct vm_area_struct *vma_merge(struct mm_struct *, struct vm_area_struct *prev, unsigned long addr, unsigned long end, unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t, - struct mempolicy *, struct vm_userfaultfd_ctx); + struct mempolicy *, struct vm_userfaultfd_ctx, const char __user *); extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *); extern int __split_vma(struct mm_struct *, struct vm_area_struct *, unsigned long addr, int new_below); @@ -3130,5 +3130,15 @@ unsigned long wp_shared_mapping_range(struct address_space *mapping, extern int sysctl_nr_trim_pages; +#ifdef CONFIG_ADVISE_SYSCALLS +int madvise_set_anon_name(unsigned long start, unsigned long len_in, + unsigned long name_addr); +#else +static inline int madvise_set_anon_name(unsigned long start, unsigned long len_in, + unsigned long name_addr) { + return 0; +} +#endif + #endif /* __KERNEL__ */ #endif /* _LINUX_MM_H */ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 496c3ff97cce..f7d54ae487e6 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -336,11 +336,19 @@ struct vm_area_struct { /* * For areas with an address space and backing store, * linkage into the address_space->i_mmap interval tree. + * + * For private anonymous mappings, a pointer to a null terminated string + * in the user process containing the name given to the vma, or NULL + * if unnamed. */ - struct { - struct rb_node rb; - unsigned long rb_subtree_last; - } shared; + + union { + struct { + struct rb_node rb; + unsigned long rb_subtree_last; + } shared; + const char __user *anon_name; + }; /* * A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma @@ -772,4 +780,13 @@ typedef struct { unsigned long val; } swp_entry_t; +/* Return the name for an anonymous mapping or NULL for a file-backed mapping */ +static inline const char __user *vma_anon_name(struct vm_area_struct *vma) +{ + if (vma->vm_file) + return NULL; + + return vma->anon_name; +} + #endif /* _LINUX_MM_TYPES_H */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 07b4f8131e36..10773270f67b 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -238,4 +238,7 @@ struct prctl_mm_map { #define PR_SET_IO_FLUSHER 57 #define PR_GET_IO_FLUSHER 58 +#define PR_SET_VMA 0x53564d41 +# define PR_SET_VMA_ANON_NAME 0 + #endif /* _LINUX_PRCTL_H */ diff --git a/kernel/sys.c b/kernel/sys.c index ab6c409b1159..1af718c1d812 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2280,6 +2280,35 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which, #define PR_IO_FLUSHER (PF_MEMALLOC_NOIO | PF_LOCAL_THROTTLE) +#ifdef CONFIG_MMU +static int prctl_set_vma(unsigned long opt, unsigned long addr, + unsigned long len, unsigned long arg) +{ + struct mm_struct *mm = current->mm; + int error; + + mmap_write_lock(mm); + + switch (opt) { + case PR_SET_VMA_ANON_NAME: + error = madvise_set_anon_name(addr, len, arg); + break; + default: + error = -EINVAL; + } + + mmap_write_unlock(mm); + + return error; +} +#else /* CONFIG_MMU */ +static int prctl_set_vma(unsigned long opt, unsigned long start, + unsigned long len_in, unsigned long arg) +{ + return -EINVAL; +} +#endif + SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) { @@ -2530,6 +2559,9 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = (current->flags & PR_IO_FLUSHER) == PR_IO_FLUSHER; break; + case PR_SET_VMA: + error = prctl_set_vma(arg2, arg3, arg4, arg5); + break; default: error = -EINVAL; break; diff --git a/mm/interval_tree.c b/mm/interval_tree.c index 11c75fb07584..dd522625d6f6 100644 --- a/mm/interval_tree.c +++ b/mm/interval_tree.c @@ -45,7 +45,7 @@ void vma_interval_tree_insert_after(struct vm_area_struct *node, parent->shared.rb_subtree_last = last; while (parent->shared.rb.rb_left) { parent = rb_entry(parent->shared.rb.rb_left, - struct vm_area_struct, shared.rb); + struct vm_area_struct, shared.rb); if (parent->shared.rb_subtree_last < last) parent->shared.rb_subtree_last = last; } diff --git a/mm/madvise.c b/mm/madvise.c index 84482c21b029..30c7326e7863 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -65,13 +65,14 @@ static int madvise_need_mmap_write(int behavior) */ static int madvise_update_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, - unsigned long end, unsigned long new_flags) + unsigned long end, unsigned long new_flags, + const char __user *new_anon_name) { struct mm_struct *mm = vma->vm_mm; int error; pgoff_t pgoff; - if (new_flags == vma->vm_flags) { + if (new_flags == vma->vm_flags && new_anon_name == vma_anon_name(vma)) { *prev = vma; return 0; } @@ -79,7 +80,7 @@ static int madvise_update_vma(struct vm_area_struct *vma, pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); *prev = vma_merge(mm, *prev, start, end, new_flags, vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), - vma->vm_userfaultfd_ctx); + vma->vm_userfaultfd_ctx, new_anon_name); if (*prev) { vma = *prev; goto success; @@ -112,10 +113,30 @@ static int madvise_update_vma(struct vm_area_struct *vma, * vm_flags is protected by the mmap_lock held in write mode. */ vma->vm_flags = new_flags; + if (!vma->vm_file) + vma->anon_name = new_anon_name; return 0; } +static int madvise_vma_anon_name(struct vm_area_struct *vma, + struct vm_area_struct **prev, + unsigned long start, unsigned long end, + unsigned long name_addr) +{ + int error; + + /* Only anonymous mappings can be named */ + if (vma->vm_file) + return -EINVAL; + + error = madvise_update_vma(vma, prev, start, end, vma->vm_flags, + (const char __user *)name_addr); + if (error == -ENOMEM) + error = -EAGAIN; + return error; +} + #ifdef CONFIG_SWAP static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start, unsigned long end, struct mm_walk *walk) @@ -877,7 +898,8 @@ static int madvise_vma_behavior(struct vm_area_struct *vma, break; } - error = madvise_update_vma(vma, prev, start, end, new_flags); + error = madvise_update_vma(vma, prev, start, end, new_flags, + vma_anon_name(vma)); out: if (error == -ENOMEM) @@ -1059,6 +1081,30 @@ int madvise_walk_vmas(unsigned long start, unsigned long end, return unmapped_error; } +int madvise_set_anon_name(unsigned long start, unsigned long len_in, + unsigned long name_addr) +{ + unsigned long end; + unsigned long len; + + if (start & ~PAGE_MASK) + return -EINVAL; + len = (len_in + ~PAGE_MASK) & PAGE_MASK; + + /* Check to see whether len was rounded up from small -ve to zero */ + if (len_in && !len) + return -EINVAL; + + end = start + len; + if (end < start) + return -EINVAL; + + if (end == start) + return 0; + + return madvise_walk_vmas(start, end, name_addr, madvise_vma_anon_name); +} + /* * The madvise(2) system call. * diff --git a/mm/mempolicy.c b/mm/mempolicy.c index eddbe4e56c73..94338d9bfe57 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -829,7 +829,8 @@ static int mbind_range(struct mm_struct *mm, unsigned long start, ((vmstart - vma->vm_start) >> PAGE_SHIFT); prev = vma_merge(mm, prev, vmstart, vmend, vma->vm_flags, vma->anon_vma, vma->vm_file, pgoff, - new_pol, vma->vm_userfaultfd_ctx); + new_pol, vma->vm_userfaultfd_ctx, + vma_anon_name(vma)); if (prev) { vma = prev; next = vma->vm_next; diff --git a/mm/mlock.c b/mm/mlock.c index 93ca2bf30b4f..8e0046c4642f 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -534,7 +534,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev, pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); *prev = vma_merge(mm, *prev, start, end, newflags, vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), - vma->vm_userfaultfd_ctx); + vma->vm_userfaultfd_ctx, vma_anon_name(vma)); if (*prev) { vma = *prev; goto success; diff --git a/mm/mmap.c b/mm/mmap.c index 40248d84ad5f..8f3cd352a48f 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -987,7 +987,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start, */ static inline int is_mergeable_vma(struct vm_area_struct *vma, struct file *file, unsigned long vm_flags, - struct vm_userfaultfd_ctx vm_userfaultfd_ctx) + struct vm_userfaultfd_ctx vm_userfaultfd_ctx, + const char __user *anon_name) { /* * VM_SOFTDIRTY should not prevent from VMA merging, if we @@ -1005,6 +1006,8 @@ static inline int is_mergeable_vma(struct vm_area_struct *vma, return 0; if (!is_mergeable_vm_userfaultfd_ctx(vma, vm_userfaultfd_ctx)) return 0; + if (vma_anon_name(vma) != anon_name) + return 0; return 1; } @@ -1037,9 +1040,10 @@ static int can_vma_merge_before(struct vm_area_struct *vma, unsigned long vm_flags, struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff, - struct vm_userfaultfd_ctx vm_userfaultfd_ctx) + struct vm_userfaultfd_ctx vm_userfaultfd_ctx, + const char __user *anon_name) { - if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) && + if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name) && is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) { if (vma->vm_pgoff == vm_pgoff) return 1; @@ -1058,9 +1062,10 @@ static int can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags, struct anon_vma *anon_vma, struct file *file, pgoff_t vm_pgoff, - struct vm_userfaultfd_ctx vm_userfaultfd_ctx) + struct vm_userfaultfd_ctx vm_userfaultfd_ctx, + const char __user *anon_name) { - if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx) && + if (is_mergeable_vma(vma, file, vm_flags, vm_userfaultfd_ctx, anon_name) && is_mergeable_anon_vma(anon_vma, vma->anon_vma, vma)) { pgoff_t vm_pglen; vm_pglen = vma_pages(vma); @@ -1071,9 +1076,9 @@ can_vma_merge_after(struct vm_area_struct *vma, unsigned long vm_flags, } /* - * Given a mapping request (addr,end,vm_flags,file,pgoff), figure out - * whether that can be merged with its predecessor or its successor. - * Or both (it neatly fills a hole). + * Given a mapping request (addr,end,vm_flags,file,pgoff,anon_name), + * figure out whether that can be merged with its predecessor or its + * successor. Or both (it neatly fills a hole). * * In most cases - when called for mmap, brk or mremap - [addr,end) is * certain not to be mapped by the time vma_merge is called; but when @@ -1118,7 +1123,8 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, unsigned long end, unsigned long vm_flags, struct anon_vma *anon_vma, struct file *file, pgoff_t pgoff, struct mempolicy *policy, - struct vm_userfaultfd_ctx vm_userfaultfd_ctx) + struct vm_userfaultfd_ctx vm_userfaultfd_ctx, + const char __user *anon_name) { pgoff_t pglen = (end - addr) >> PAGE_SHIFT; struct vm_area_struct *area, *next; @@ -1151,7 +1157,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, mpol_equal(vma_policy(prev), policy) && can_vma_merge_after(prev, vm_flags, anon_vma, file, pgoff, - vm_userfaultfd_ctx)) { + vm_userfaultfd_ctx, anon_name)) { /* * OK, it can. Can we now merge in the successor as well? */ @@ -1160,7 +1166,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, can_vma_merge_before(next, vm_flags, anon_vma, file, pgoff+pglen, - vm_userfaultfd_ctx) && + vm_userfaultfd_ctx, anon_name) && is_mergeable_anon_vma(prev->anon_vma, next->anon_vma, NULL)) { /* cases 1, 6 */ @@ -1183,7 +1189,7 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm, mpol_equal(policy, vma_policy(next)) && can_vma_merge_before(next, vm_flags, anon_vma, file, pgoff+pglen, - vm_userfaultfd_ctx)) { + vm_userfaultfd_ctx, anon_name)) { if (prev && addr < prev->vm_end) /* case 4 */ err = __vma_adjust(prev, prev->vm_start, addr, prev->vm_pgoff, NULL, next); @@ -1731,7 +1737,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, * Can we just expand an old mapping? */ vma = vma_merge(mm, prev, addr, addr + len, vm_flags, - NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX); + NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX, NULL); if (vma) goto out; @@ -1779,7 +1785,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr, */ if (unlikely(vm_flags != vma->vm_flags && prev)) { merge = vma_merge(mm, prev, vma->vm_start, vma->vm_end, vma->vm_flags, - NULL, vma->vm_file, vma->vm_pgoff, NULL, NULL_VM_UFFD_CTX); + NULL, vma->vm_file, vma->vm_pgoff, NULL, NULL_VM_UFFD_CTX, NULL); if (merge) { fput(file); vm_area_free(vma); @@ -3063,7 +3069,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla /* Can we just expand an old private anonymous mapping? */ vma = vma_merge(mm, prev, addr, addr + len, flags, - NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX); + NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX, NULL); if (vma) goto out; @@ -3262,7 +3268,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, return NULL; /* should never get here */ new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags, vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), - vma->vm_userfaultfd_ctx); + vma->vm_userfaultfd_ctx, vma_anon_name(vma)); if (new_vma) { /* * Source vma may have been merged into new_vma diff --git a/mm/mprotect.c b/mm/mprotect.c index ce8b8a5eacbb..d90c349a3fd9 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -454,7 +454,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); *pprev = vma_merge(mm, *pprev, start, end, newflags, vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma), - vma->vm_userfaultfd_ctx); + vma->vm_userfaultfd_ctx, vma_anon_name(vma)); if (*pprev) { vma = *pprev; VM_WARN_ON((vma->vm_flags ^ newflags) & ~VM_SOFTDIRTY);