From patchwork Mon Jun 22 19:28:57 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Minchan Kim X-Patchwork-Id: 11618789 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0AC7A618 for ; Mon, 22 Jun 2020 19:29:19 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id BF89D207DD for ; Mon, 22 Jun 2020 19:29:18 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="rEIKdVkM" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BF89D207DD Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id CE0DB6B0007; Mon, 22 Jun 2020 15:29:17 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id C40C66B0008; Mon, 22 Jun 2020 15:29:17 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AE4F26B000A; Mon, 22 Jun 2020 15:29:17 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0042.hostedemail.com [216.40.44.42]) by kanga.kvack.org (Postfix) with ESMTP id 916DF6B0007 for ; Mon, 22 Jun 2020 15:29:17 -0400 (EDT) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 55E5622014F for ; Mon, 22 Jun 2020 19:29:17 +0000 (UTC) X-FDA: 76957836354.24.toy14_0a08bc326e35 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin24.hostedemail.com (Postfix) with ESMTP id 2915B22014B for ; Mon, 22 Jun 2020 19:29:17 +0000 (UTC) X-Spam-Summary: 2,0,0,c922c03409ba501e,d41d8cd98f00b204,minchan.kim@gmail.com,,RULES_HIT:1:2:41:69:355:379:541:617:800:960:966:967:968:973:982:988:989:1260:1263:1311:1314:1345:1359:1431:1437:1515:1605:1730:1747:1777:1792:2194:2196:2198:2199:2200:2201:2376:2393:2525:2559:2565:2682:2685:2693:2859:2892:2901:2902:2911:2933:2937:2939:2942:2945:2947:2951:2954:3022:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:3934:3936:3938:3941:3944:3947:3950:3953:3956:3959:4052:4250:4321:4385:4425:4605:5007:6261:6653:6742:6743:7514:7875:7901:7903:8599:8660:8957:9025:9592:10004:10913:11026:11257:11473:11658:11914:12043:12291:12296:12297:12438:12517:12519:12555:12679:12683:12783:12895:12986:13053:13141:13148:13180:13191:13192:13229:13230:13846:13894:14096:14394:21080:21094:21323:21324:21433:21444:21451:21627:21796:21939:21987:21990:30003:30036:30054:30064:30070:30075,0,RBL:209.85.214.195:@gmail.com:.lbl8.mailshell.net-66.100.201.100 62.50.0.100,CacheIP:none,Bayesian:0.5,0.5,0.5,N etcheck: X-HE-Tag: toy14_0a08bc326e35 X-Filterd-Recvd-Size: 13665 Received: from mail-pl1-f195.google.com (mail-pl1-f195.google.com [209.85.214.195]) by imf24.hostedemail.com (Postfix) with ESMTP for ; Mon, 22 Jun 2020 19:29:16 +0000 (UTC) Received: by mail-pl1-f195.google.com with SMTP id n2so7981375pld.13 for ; Mon, 22 Jun 2020 12:29:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=HbFOH3KsLkwu7mV1dH7vf2Ocr0jZ4TZspcDQfB1XUms=; b=rEIKdVkMazTFPzZQJ2KD9SLUAMOrte84CbOi/Nz3xck63bBdvKxIDQ5TyuVbNh+XYY tghe8jzq4aPitM88+4KbeY08RsF1ZoRU0z3x5YWhQFwdYTK1TpqguWoqrTjWgSWFHDbs 3S3hYlLCJm4E4A0d2PGz5A01sN+Lte5qUfQ158EIV+k+ol8LCLdHVDpmaPSuN4PUVKT6 uTQwf4sewPNYVI0axuluv7YvpEHkY5qy1Vgqj5AnAfeh7EjUq1oY7HZFZUELjMqGte5V rFLohzAKjqayqJqZIqahh/l7yvvfD13ILWIX0HWe4JWcKu1uzw0uz6/UcIVj/3lp9G+L M+8g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:from:to:cc:subject:date:message-id :in-reply-to:references:mime-version:content-transfer-encoding; bh=HbFOH3KsLkwu7mV1dH7vf2Ocr0jZ4TZspcDQfB1XUms=; b=R9YNxR5fzz53Up4bJZyeZWD76f4+xxhaATwbuFsueyt3kfqzorEhJwU5xGidoP9bhJ 5fOkrG3bIL2Nr8wmbdA3hMv5DXZLgtn9y2nLPmikpl2GHofhbUNlTkcMxoKWvLVj9Fsw dTua2MMlfq3U+Tk7uYdONJftgqvhqGE94oTeV9jSFz8fVuIoUUPExrAHoq/e++VCt8Ls e8Pp6i84Yq1cNmfgsiwJ3V24AH4ASdQ5Ye0JcmX0YEIzEbZXrTInNJhFG1IzdKocE5Gf S+/qiuAyEeeSkdQmhrOaIIPeS6aoFPBD3UNZzzoNJnFtgz94Ulz8NDE7wt6urWGGkotV qDYg== X-Gm-Message-State: AOAM532WGqeQa5EujQxr83JLy1t8Ci80e15G6zteOHhiov8PpxQoz1qM jbKVCu8O+ByjSOPObuWIF5Y= X-Google-Smtp-Source: ABdhPJyOOiWfPOPOJcVsXa//HJW72lKt13LEjf5hYDr2gZYAmOXaIhIvxB38Qzw3t3IFn/QNKUg5Mw== X-Received: by 2002:a17:90a:de0f:: with SMTP id m15mr18815977pjv.21.1592854155501; Mon, 22 Jun 2020 12:29:15 -0700 (PDT) Received: from bbox-1.mtv.corp.google.com ([2620:15c:211:1:3e01:2939:5992:52da]) by smtp.gmail.com with ESMTPSA id mu17sm264603pjb.53.2020.06.22.12.29.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Jun 2020 12:29:14 -0700 (PDT) From: Minchan Kim To: Andrew Morton Cc: LKML , Christian Brauner , linux-mm , linux-api@vger.kernel.org, oleksandr@redhat.com, Suren Baghdasaryan , Tim Murray , Sandeep Patil , Sonny Rao , Brian Geffon , Michal Hocko , Johannes Weiner , Shakeel Butt , John Dias , Joel Fernandes , Jann Horn , alexander.h.duyck@linux.intel.com, sj38.park@gmail.com, David Rientjes , Arjun Roy , Minchan Kim , Vlastimil Babka , Jens Axboe , Daniel Colascione , Christian Brauner , Kirill Tkhai , SeongJae Park , linux-man@vger.kernel.org Subject: [PATCH v8 1/4] mm/madvise: pass task and mm to do_madvise Date: Mon, 22 Jun 2020 12:28:57 -0700 Message-Id: <20200622192900.22757-2-minchan@kernel.org> X-Mailer: git-send-email 2.27.0.111.gc72c7da667-goog In-Reply-To: <20200622192900.22757-1-minchan@kernel.org> References: <20200622192900.22757-1-minchan@kernel.org> MIME-Version: 1.0 X-Rspamd-Queue-Id: 2915B22014B X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam01 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Patch series "introduce memory hinting API for external process", v8. Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With that, application could give hints to kernel what memory range are preferred to be reclaimed. However, in some platform(e.g., Android), the information required to make the hinting decision is not known to the app. Instead, it is known to a centralized userspace daemon(e.g., ActivityManagerService), and that daemon must be able to initiate reclaim on its own without any app involvement. To solve the concern, this patch introduces new syscall - process_madvise(2). Bascially, it's same with madvise(2) syscall but it has some differences. 1. It needs pidfd of target process to provide the hint 2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this moment. Other hints in madvise will be opened when there are explicit requests from community to prevent unexpected bugs we couldn't support. 3. Only privileged processes can do something for other process's address space. For more detail of the new API, please see "mm: introduce external memory hinting API" description in this patchset. This patch (of 4): In upcoming patches, do_madvise will be called from external process context so we shouldn't asssume "current" is always hinted process's task_struct. Furthermore, we must not access mm_struct via task->mm, but obtain it via access_mm() once (in the following patch) and only use that pointer [1], so pass it to do_madvise() as well. Note the vma->vm_mm pointers are safe, so we can use them further down the call stack. And let's pass *current* and current->mm as arguments of do_madvise so it shouldn't change existing behavior but prepare next patch to make review easy. Note: io_madvise passes NULL as target_task argument of do_madvise because it couldn't know who is target. [1] http://lore.kernel.org/r/CAG48ez27=pwm5m_N_988xT1huO7g7h6arTQL44zev6TD-h-7Tg@mail.gmail.com [vbabka@suse.cz: changelog tweak] [minchan@kernel.org: use current->mm for io_uring] Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org [akpm@linux-foundation.org: fix it for upstream changes] [akpm@linux-foundation.org: whoops] [rdunlap@infradead.org: add missing includes] Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org Signed-off-by: Minchan Kim Reviewed-by: Suren Baghdasaryan Reviewed-by: Vlastimil Babka Cc: Jens Axboe Cc: Jann Horn Cc: Tim Murray Cc: Daniel Colascione Cc: Sandeep Patil Cc: Sonny Rao Cc: Brian Geffon Cc: Michal Hocko Cc: Johannes Weiner Cc: Shakeel Butt Cc: John Dias Cc: Joel Fernandes Cc: Alexander Duyck Cc: SeongJae Park Cc: Christian Brauner Cc: Kirill Tkhai Cc: Oleksandr Natalenko Cc: SeongJae Park Cc: Christian Brauner Cc: Acked-by: David Rientjes --- fs/io_uring.c | 2 +- include/linux/mm.h | 3 ++- mm/madvise.c | 40 +++++++++++++++++++++++----------------- 3 files changed, 26 insertions(+), 19 deletions(-) diff --git a/fs/io_uring.c b/fs/io_uring.c index 59c8871464b6..063946e17a59 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3586,7 +3586,7 @@ static int io_madvise(struct io_kiocb *req, bool force_nonblock) if (force_nonblock) return -EAGAIN; - ret = do_madvise(ma->addr, ma->len, ma->advice); + ret = do_madvise(NULL, current->mm, ma->addr, ma->len, ma->advice); if (ret < 0) req_set_fail_links(req); io_cqring_add_event(req, ret); diff --git a/include/linux/mm.h b/include/linux/mm.h index e6ff54a7b284..30729f675b98 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2541,7 +2541,8 @@ extern int __do_munmap(struct mm_struct *, unsigned long, size_t, struct list_head *uf, bool downgrade); extern int do_munmap(struct mm_struct *, unsigned long, size_t, struct list_head *uf); -extern int do_madvise(unsigned long start, size_t len_in, int behavior); +extern int do_madvise(struct task_struct *target_task, struct mm_struct *mm, + unsigned long start, size_t len_in, int behavior); static inline unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, diff --git a/mm/madvise.c b/mm/madvise.c index dd1d43cf026d..551ed816eefe 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -22,12 +22,14 @@ #include #include #include +#include #include #include #include #include #include #include +#include #include @@ -255,6 +257,7 @@ static long madvise_willneed(struct vm_area_struct *vma, struct vm_area_struct **prev, unsigned long start, unsigned long end) { + struct mm_struct *mm = vma->vm_mm; struct file *file = vma->vm_file; loff_t offset; @@ -289,12 +292,12 @@ static long madvise_willneed(struct vm_area_struct *vma, */ *prev = NULL; /* tell sys_madvise we drop mmap_lock */ get_file(file); - mmap_read_unlock(current->mm); + mmap_read_unlock(mm); offset = (loff_t)(start - vma->vm_start) + ((loff_t)vma->vm_pgoff << PAGE_SHIFT); vfs_fadvise(file, offset, end - start, POSIX_FADV_WILLNEED); fput(file); - mmap_read_lock(current->mm); + mmap_read_lock(mm); return 0; } @@ -683,7 +686,6 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, if (nr_swap) { if (current->mm == mm) sync_mm_rss(mm); - add_mm_counter(mm, MM_SWAPENTS, nr_swap); } arch_leave_lazy_mmu_mode(); @@ -763,6 +765,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, unsigned long start, unsigned long end, int behavior) { + struct mm_struct *mm = vma->vm_mm; + *prev = vma; if (!can_madv_lru_vma(vma)) return -EINVAL; @@ -770,8 +774,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma, if (!userfaultfd_remove(vma, start, end)) { *prev = NULL; /* mmap_lock has been dropped, prev is stale */ - mmap_read_lock(current->mm); - vma = find_vma(current->mm, start); + mmap_read_lock(mm); + vma = find_vma(mm, start); if (!vma) return -ENOMEM; if (start < vma->vm_start) { @@ -825,6 +829,7 @@ static long madvise_remove(struct vm_area_struct *vma, loff_t offset; int error; struct file *f; + struct mm_struct *mm = vma->vm_mm; *prev = NULL; /* tell sys_madvise we drop mmap_lock */ @@ -852,13 +857,13 @@ static long madvise_remove(struct vm_area_struct *vma, get_file(f); if (userfaultfd_remove(vma, start, end)) { /* mmap_lock was not released by userfaultfd_remove() */ - mmap_read_unlock(current->mm); + mmap_read_unlock(mm); } error = vfs_fallocate(f, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, offset, end - start); fput(f); - mmap_read_lock(current->mm); + mmap_read_lock(mm); return error; } @@ -1051,7 +1056,8 @@ madvise_behavior_valid(int behavior) * -EBADF - map exists, but area maps something that isn't a file. * -EAGAIN - a kernel resource was temporarily unavailable. */ -int do_madvise(unsigned long start, size_t len_in, int behavior) +int do_madvise(struct task_struct *target_task, struct mm_struct *mm, + unsigned long start, size_t len_in, int behavior) { unsigned long end, tmp; struct vm_area_struct *vma, *prev; @@ -1089,7 +1095,7 @@ int do_madvise(unsigned long start, size_t len_in, int behavior) write = madvise_need_mmap_write(behavior); if (write) { - if (mmap_write_lock_killable(current->mm)) + if (mmap_write_lock_killable(mm)) return -EINTR; /* @@ -1104,12 +1110,12 @@ int do_madvise(unsigned long start, size_t len_in, int behavior) * but for now we have the mmget_still_valid() * model. */ - if (!mmget_still_valid(current->mm)) { - mmap_write_unlock(current->mm); + if (!mmget_still_valid(mm)) { + mmap_write_unlock(mm); return -EINTR; } } else { - mmap_read_lock(current->mm); + mmap_read_lock(mm); } /* @@ -1117,7 +1123,7 @@ int do_madvise(unsigned long start, size_t len_in, int behavior) * ranges, just ignore them, but return -ENOMEM at the end. * - different from the way of handling in mlock etc. */ - vma = find_vma_prev(current->mm, start, &prev); + vma = find_vma_prev(mm, start, &prev); if (vma && start > vma->vm_start) prev = vma; @@ -1154,19 +1160,19 @@ int do_madvise(unsigned long start, size_t len_in, int behavior) if (prev) vma = prev->vm_next; else /* madvise_remove dropped mmap_lock */ - vma = find_vma(current->mm, start); + vma = find_vma(mm, start); } out: blk_finish_plug(&plug); if (write) - mmap_write_unlock(current->mm); + mmap_write_unlock(mm); else - mmap_read_unlock(current->mm); + mmap_read_unlock(mm); return error; } SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) { - return do_madvise(start, len_in, behavior); + return do_madvise(current, current->mm, start, len_in, behavior); }