From patchwork Fri Jul 26 15:23:18 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Joel Fernandes X-Patchwork-Id: 11061245 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2E984112C for ; Fri, 26 Jul 2019 15:23:32 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 1D79B28924 for ; Fri, 26 Jul 2019 15:23:32 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 10C1128A69; Fri, 26 Jul 2019 15:23:32 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=unavailable version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E246428924 for ; Fri, 26 Jul 2019 15:23:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E45F06B0007; Fri, 26 Jul 2019 11:23:29 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id DF7D66B0008; Fri, 26 Jul 2019 11:23:29 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CBEA28E0002; Fri, 26 Jul 2019 11:23:29 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pg1-f199.google.com (mail-pg1-f199.google.com [209.85.215.199]) by kanga.kvack.org (Postfix) with ESMTP id 900EE6B0007 for ; Fri, 26 Jul 2019 11:23:29 -0400 (EDT) Received: by mail-pg1-f199.google.com with SMTP id q10so10007069pgi.9 for ; Fri, 26 Jul 2019 08:23:29 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:mime-version:content-transfer-encoding; bh=U9b/TtYz1iKnMrboxgver+vJZYh0y0/osy9aVw9hhpk=; b=ehrVXV7lCWz5f1WaORv20UzTM5lllUPUOJZ4imcZqvIflZMMw2cI4/ZWOGmoO8qwzy 9ga8vQlV7dgmSPBQjjBJpfgw86P+Ebya0fMmHH6NcKtYZ6KbJcKMf1wklhYKa4gt2Drq o80NEVwI8FiRcMgTy13vmHgKYgRBj1akhpoFc/dr60u9D7D/g7JDUzSnV8aH28T8yLCM uBqJOltWu5tWrHn/qfnkPzhAB9Ip/37JcrWOE5O/LxBoCCWDb/LJT6jyW+lRmOyQa/e/ At9lyYc90sHeNYvZogVZEACjowcbkDYZX6LbCHJr1GeZqLzC/e58vCqxB+1tfEfCsQEN zy1g== X-Gm-Message-State: APjAAAUVy/MXd9wGG0wqzMXXkfRZNfa6gQmrFJqUDvqVCIHtQyl45yhs FVT8VbMrk9zbVtQ0YfmEaiSTYD85CDxrd+2TcchWU91MLgG1Ruly6hJjY29V5+uE6zszxx/4CkM 4vMq9h+xbaE9PNYIRR6BCB0gO6rQZDbXpLkTzv5bhrJ3857AWtXQFx4eZGVXqVDvMvQ== X-Received: by 2002:aa7:915a:: with SMTP id 26mr22738582pfi.247.1564154609187; Fri, 26 Jul 2019 08:23:29 -0700 (PDT) X-Received: by 2002:aa7:915a:: with SMTP id 26mr22738475pfi.247.1564154607710; Fri, 26 Jul 2019 08:23:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1564154607; cv=none; d=google.com; s=arc-20160816; b=BLbmObQNT/HR7IZcTJUFg+zZTe5GMrR1Ylue/QdMUd3mRdAuU+hfAs5IjZcPS1EFzh Ean9qkWnTGhcOn24DpuDFD7/93bxc5j5s2klOGi6aj4ntWQi6Ijgq/y2QXK/+vHK2KfC aj4dO07+PzawWA0zQQmqpdeHk1X2uCNPX3hq5OufNkdIJYz8FWoxptQrXktogO/D6QDp XbL3rxQHCCejt2zUYDVO293BaAdQY6yKa0FfHIWVM3BYBcyXdLylE1wDilMBJjcX5nIc APhF5W7WY+QErAT2LH0os9ZTAG4qwrGxmPO9UykiTA/2VAlwdp+UtfF7ZWq39GeQdmwf UJkw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:dkim-signature; bh=U9b/TtYz1iKnMrboxgver+vJZYh0y0/osy9aVw9hhpk=; b=XVuSOM89rvN8KgN4/HCntq8e4piHkjcyxUXdFX0X7H3ib+Lu3vNXanf6y2oHcQDu16 o9YRwAU29oFjsd0yd3tMkllliNAuhLyf7sWMJqcOi2/+Bje7nHJ+O52KG2BojD7YlyCD VwhBkYElNlH+CVYzYYCdXvSamIn34JRoeT5LmeQ8b4JY9QAx9D8mxLTANVrIYkNDAZTg ONC58/7B+1vZFDtpD4P2s3J7qu3GpZrHd7xYWxK4NI2OevVTctYiSV6GAKvu9FsNHrBC 5Vt6CBfVtWARCb52kMhCiRoMY+DAG1ImeqT2HRCL+t9yPBv6f9kv6zrgK5MdrcA9mhCp AuMQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=Y4LKHtvI; spf=pass (google.com: domain of joel@joelfernandes.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=joel@joelfernandes.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id x15sor63402459pln.50.2019.07.26.08.23.27 for (Google Transport Security); Fri, 26 Jul 2019 08:23:27 -0700 (PDT) Received-SPF: pass (google.com: domain of joel@joelfernandes.org designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=Y4LKHtvI; spf=pass (google.com: domain of joel@joelfernandes.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=joel@joelfernandes.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joelfernandes.org; s=google; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=U9b/TtYz1iKnMrboxgver+vJZYh0y0/osy9aVw9hhpk=; b=Y4LKHtvIaPsGROIJrc1C3WNWrOkr/i+gILgFq5b75xGvtx0hNjZMfuhwYjjs6OUjcW 7IapM5jV7aIVsR9LMspWv6vj1kRiGrtbRIU4kPe5YY32xbRdegUEoO3TT0GL6OB7H378 AdA9rX4CZ+SpyhYBkBFaLCEOkMiFfcPsW0ld0= X-Google-Smtp-Source: APXvYqxyQYXUimyOzJZKsB/I/dlXR56ME01qUck0X3lYAnQUnns4/d5hiO5lmyT+6RLXQTkhwa9EvA== X-Received: by 2002:a17:902:848b:: with SMTP id c11mr96345612plo.217.1564154607118; Fri, 26 Jul 2019 08:23:27 -0700 (PDT) Received: from joelaf.cam.corp.google.com ([2620:15c:6:12:9c46:e0da:efbf:69cc]) by smtp.gmail.com with ESMTPSA id w132sm55268640pfd.78.2019.07.26.08.23.23 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Fri, 26 Jul 2019 08:23:26 -0700 (PDT) From: "Joel Fernandes (Google)" To: linux-kernel@vger.kernel.org Cc: "Joel Fernandes (Google)" , Alexey Dobriyan , Andrew Morton , Brendan Gregg , Christian Hansen , dancol@google.com, fmayer@google.com, joaodias@google.com, joelaf@google.com, Jonathan Corbet , Kees Cook , kernel-team@android.com, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Michal Hocko , Mike Rapoport , minchan@kernel.org, namhyung@google.com, Roman Gushchin , Stephen Rothwell , surenb@google.com, tkjos@google.com, Vladimir Davydov , Vlastimil Babka , wvw@google.com Subject: [PATCH v3 1/2] mm/page_idle: Add per-pid idle page tracking using virtual indexing Date: Fri, 26 Jul 2019 11:23:18 -0400 Message-Id: <20190726152319.134152-1-joel@joelfernandes.org> X-Mailer: git-send-email 2.22.0.709.g102302147b-goog MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP The page_idle tracking feature currently requires looking up the pagemap for a process followed by interacting with /sys/kernel/mm/page_idle. Looking up PFN from pagemap in Android devices is not supported by unprivileged process and requires SYS_ADMIN and gives 0 for the PFN. This patch adds support to directly interact with page_idle tracking at the PID level by introducing a /proc//page_idle file. It follows the exact same semantics as the global /sys/kernel/mm/page_idle, but now looking up PFN through pagemap is not needed since the interface uses virtual frame numbers, and at the same time also does not require SYS_ADMIN. In Android, we are using this for the heap profiler (heapprofd) which profiles and pin points code paths which allocates and leaves memory idle for long periods of time. This method solves the security issue with userspace learning the PFN, and while at it is also shown to yield better results than the pagemap lookup, the theory being that the window where the address space can change is reduced by eliminating the intermediate pagemap look up stage. In virtual address indexing, the process's mmap_sem is held for the duration of the access. Signed-off-by: Joel Fernandes (Google) --- v2->v3: Fixed a bug where I was doing a kfree that is not needed due to not needing to do GFP_ATOMIC allocations. v1->v2: Mark swap ptes as idle (Minchan) Avoid need for GFP_ATOMIC (Andrew) Get rid of idle_page_list lock by moving list to stack Internal review -> v1: Fixes from Suren. Corrections to change log, docs (Florian, Sandeep) fs/proc/base.c | 3 + fs/proc/internal.h | 1 + fs/proc/task_mmu.c | 57 +++++++ include/linux/page_idle.h | 4 + mm/page_idle.c | 340 +++++++++++++++++++++++++++++++++----- 5 files changed, 360 insertions(+), 45 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 77eb628ecc7f..a58dd74606e9 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -3021,6 +3021,9 @@ static const struct pid_entry tgid_base_stuff[] = { REG("smaps", S_IRUGO, proc_pid_smaps_operations), REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), +#ifdef CONFIG_IDLE_PAGE_TRACKING + REG("page_idle", S_IRUSR|S_IWUSR, proc_page_idle_operations), +#endif #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), diff --git a/fs/proc/internal.h b/fs/proc/internal.h index cd0c8d5ce9a1..bc9371880c63 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -293,6 +293,7 @@ extern const struct file_operations proc_pid_smaps_operations; extern const struct file_operations proc_pid_smaps_rollup_operations; extern const struct file_operations proc_clear_refs_operations; extern const struct file_operations proc_pagemap_operations; +extern const struct file_operations proc_page_idle_operations; extern unsigned long task_vsize(struct mm_struct *); extern unsigned long task_statm(struct mm_struct *, diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 4d2b860dbc3f..11ccc53da38e 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1642,6 +1642,63 @@ const struct file_operations proc_pagemap_operations = { .open = pagemap_open, .release = pagemap_release, }; + +#ifdef CONFIG_IDLE_PAGE_TRACKING +static ssize_t proc_page_idle_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + int ret; + struct task_struct *tsk = get_proc_task(file_inode(file)); + + if (!tsk) + return -EINVAL; + ret = page_idle_proc_read(file, buf, count, ppos, tsk); + put_task_struct(tsk); + return ret; +} + +static ssize_t proc_page_idle_write(struct file *file, const char __user *buf, + size_t count, loff_t *ppos) +{ + int ret; + struct task_struct *tsk = get_proc_task(file_inode(file)); + + if (!tsk) + return -EINVAL; + ret = page_idle_proc_write(file, (char __user *)buf, count, ppos, tsk); + put_task_struct(tsk); + return ret; +} + +static int proc_page_idle_open(struct inode *inode, struct file *file) +{ + struct mm_struct *mm; + + mm = proc_mem_open(inode, PTRACE_MODE_READ); + if (IS_ERR(mm)) + return PTR_ERR(mm); + file->private_data = mm; + return 0; +} + +static int proc_page_idle_release(struct inode *inode, struct file *file) +{ + struct mm_struct *mm = file->private_data; + + if (mm) + mmdrop(mm); + return 0; +} + +const struct file_operations proc_page_idle_operations = { + .llseek = mem_lseek, /* borrow this */ + .read = proc_page_idle_read, + .write = proc_page_idle_write, + .open = proc_page_idle_open, + .release = proc_page_idle_release, +}; +#endif /* CONFIG_IDLE_PAGE_TRACKING */ + #endif /* CONFIG_PROC_PAGE_MONITOR */ #ifdef CONFIG_NUMA diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h index 1e894d34bdce..f1bc2640d85e 100644 --- a/include/linux/page_idle.h +++ b/include/linux/page_idle.h @@ -106,6 +106,10 @@ static inline void clear_page_idle(struct page *page) } #endif /* CONFIG_64BIT */ +ssize_t page_idle_proc_write(struct file *file, + char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk); +ssize_t page_idle_proc_read(struct file *file, + char __user *buf, size_t count, loff_t *ppos, struct task_struct *tsk); #else /* !CONFIG_IDLE_PAGE_TRACKING */ static inline bool page_is_young(struct page *page) diff --git a/mm/page_idle.c b/mm/page_idle.c index 295512465065..86244f7f1faa 100644 --- a/mm/page_idle.c +++ b/mm/page_idle.c @@ -5,12 +5,15 @@ #include #include #include -#include -#include -#include #include +#include #include #include +#include +#include +#include +#include +#include #define BITMAP_CHUNK_SIZE sizeof(u64) #define BITMAP_CHUNK_BITS (BITMAP_CHUNK_SIZE * BITS_PER_BYTE) @@ -25,18 +28,13 @@ * page tracking. With such an indicator of user pages we can skip isolated * pages, but since there are not usually many of them, it will hardly affect * the overall result. - * - * This function tries to get a user memory page by pfn as described above. */ -static struct page *page_idle_get_page(unsigned long pfn) +static struct page *page_idle_get_page(struct page *page_in) { struct page *page; pg_data_t *pgdat; - if (!pfn_valid(pfn)) - return NULL; - - page = pfn_to_page(pfn); + page = page_in; if (!page || !PageLRU(page) || !get_page_unless_zero(page)) return NULL; @@ -51,6 +49,18 @@ static struct page *page_idle_get_page(unsigned long pfn) return page; } +/* + * This function tries to get a user memory page by pfn as described above. + */ +static struct page *page_idle_get_page_pfn(unsigned long pfn) +{ + + if (!pfn_valid(pfn)) + return NULL; + + return page_idle_get_page(pfn_to_page(pfn)); +} + static bool page_idle_clear_pte_refs_one(struct page *page, struct vm_area_struct *vma, unsigned long addr, void *arg) @@ -118,6 +128,47 @@ static void page_idle_clear_pte_refs(struct page *page) unlock_page(page); } +/* Helper to get the start and end frame given a pos and count */ +static int page_idle_get_frames(loff_t pos, size_t count, struct mm_struct *mm, + unsigned long *start, unsigned long *end) +{ + unsigned long max_frame; + + /* If an mm is not given, assume we want physical frames */ + max_frame = mm ? (mm->task_size >> PAGE_SHIFT) : max_pfn; + + if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE) + return -EINVAL; + + *start = pos * BITS_PER_BYTE; + if (*start >= max_frame) + return -ENXIO; + + *end = *start + count * BITS_PER_BYTE; + if (*end > max_frame) + *end = max_frame; + return 0; +} + +static bool page_really_idle(struct page *page) +{ + if (!page) + return false; + + if (page_is_idle(page)) { + /* + * The page might have been referenced via a + * pte, in which case it is not idle. Clear + * refs and recheck. + */ + page_idle_clear_pte_refs(page); + if (page_is_idle(page)) + return true; + } + + return false; +} + static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj, struct bin_attribute *attr, char *buf, loff_t pos, size_t count) @@ -125,35 +176,21 @@ static ssize_t page_idle_bitmap_read(struct file *file, struct kobject *kobj, u64 *out = (u64 *)buf; struct page *page; unsigned long pfn, end_pfn; - int bit; + int bit, ret; - if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE) - return -EINVAL; - - pfn = pos * BITS_PER_BYTE; - if (pfn >= max_pfn) - return 0; - - end_pfn = pfn + count * BITS_PER_BYTE; - if (end_pfn > max_pfn) - end_pfn = max_pfn; + ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn); + if (ret == -ENXIO) + return 0; /* Reads beyond max_pfn do nothing */ + else if (ret) + return ret; for (; pfn < end_pfn; pfn++) { bit = pfn % BITMAP_CHUNK_BITS; if (!bit) *out = 0ULL; - page = page_idle_get_page(pfn); - if (page) { - if (page_is_idle(page)) { - /* - * The page might have been referenced via a - * pte, in which case it is not idle. Clear - * refs and recheck. - */ - page_idle_clear_pte_refs(page); - if (page_is_idle(page)) - *out |= 1ULL << bit; - } + page = page_idle_get_page_pfn(pfn); + if (page && page_really_idle(page)) { + *out |= 1ULL << bit; put_page(page); } if (bit == BITMAP_CHUNK_BITS - 1) @@ -170,23 +207,16 @@ static ssize_t page_idle_bitmap_write(struct file *file, struct kobject *kobj, const u64 *in = (u64 *)buf; struct page *page; unsigned long pfn, end_pfn; - int bit; + int bit, ret; - if (pos % BITMAP_CHUNK_SIZE || count % BITMAP_CHUNK_SIZE) - return -EINVAL; - - pfn = pos * BITS_PER_BYTE; - if (pfn >= max_pfn) - return -ENXIO; - - end_pfn = pfn + count * BITS_PER_BYTE; - if (end_pfn > max_pfn) - end_pfn = max_pfn; + ret = page_idle_get_frames(pos, count, NULL, &pfn, &end_pfn); + if (ret) + return ret; for (; pfn < end_pfn; pfn++) { bit = pfn % BITMAP_CHUNK_BITS; if ((*in >> bit) & 1) { - page = page_idle_get_page(pfn); + page = page_idle_get_page_pfn(pfn); if (page) { page_idle_clear_pte_refs(page); set_page_idle(page); @@ -224,6 +254,226 @@ struct page_ext_operations page_idle_ops = { }; #endif +/* page_idle tracking for /proc//page_idle */ + +struct page_node { + struct page *page; + unsigned long addr; + struct list_head list; +}; + +struct page_idle_proc_priv { + unsigned long start_addr; + char *buffer; + int write; + + /* Pre-allocate and provide nodes to add_page_idle_list() */ + struct page_node *page_nodes; + int cur_page_node; + struct list_head *idle_page_list; +}; + +/* + * Add a page to the idle page list. page can be NULL if pte is + * from a swapped page. + */ +static void add_page_idle_list(struct page *page, + unsigned long addr, struct mm_walk *walk) +{ + struct page *page_get = NULL; + struct page_node *pn; + int bit; + unsigned long frames; + struct page_idle_proc_priv *priv = walk->private; + u64 *chunk = (u64 *)priv->buffer; + + if (priv->write) { + /* Find whether this page was asked to be marked */ + frames = (addr - priv->start_addr) >> PAGE_SHIFT; + bit = frames % BITMAP_CHUNK_BITS; + chunk = &chunk[frames / BITMAP_CHUNK_BITS]; + if (((*chunk >> bit) & 1) == 0) + return; + } + + if (page) { + page_get = page_idle_get_page(page); + if (!page_get) + return; + } + + pn = &(priv->page_nodes[priv->cur_page_node++]); + pn->page = page_get; + pn->addr = addr; + list_add(&pn->list, priv->idle_page_list); +} + +static int pte_page_idle_proc_range(pmd_t *pmd, unsigned long addr, + unsigned long end, + struct mm_walk *walk) +{ + struct vm_area_struct *vma = walk->vma; + pte_t *pte; + spinlock_t *ptl; + struct page *page; + + ptl = pmd_trans_huge_lock(pmd, vma); + if (ptl) { + if (pmd_present(*pmd)) { + page = follow_trans_huge_pmd(vma, addr, pmd, + FOLL_DUMP|FOLL_WRITE); + if (!IS_ERR_OR_NULL(page)) + add_page_idle_list(page, addr, walk); + } + spin_unlock(ptl); + return 0; + } + + if (pmd_trans_unstable(pmd)) + return 0; + + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); + for (; addr != end; pte++, addr += PAGE_SIZE) { + /* + * We add swapped pages to the idle_page_list so that we can + * reported to userspace that they are idle. + */ + if (is_swap_pte(*pte)) { + add_page_idle_list(NULL, addr, walk); + continue; + } + + if (!pte_present(*pte)) + continue; + + page = vm_normal_page(vma, addr, *pte); + if (page) + add_page_idle_list(page, addr, walk); + } + + pte_unmap_unlock(pte - 1, ptl); + return 0; +} + +ssize_t page_idle_proc_generic(struct file *file, char __user *ubuff, + size_t count, loff_t *pos, + struct task_struct *tsk, int write) +{ + int ret; + char *buffer; + u64 *out; + unsigned long start_addr, end_addr, start_frame, end_frame; + struct mm_struct *mm = file->private_data; + struct mm_walk walk = { .pmd_entry = pte_page_idle_proc_range, }; + struct page_node *cur; + struct page_idle_proc_priv priv; + bool walk_error = false; + LIST_HEAD(idle_page_list); + + if (!mm || !mmget_not_zero(mm)) + return -EINVAL; + + if (count > PAGE_SIZE) + count = PAGE_SIZE; + + buffer = kzalloc(PAGE_SIZE, GFP_KERNEL); + if (!buffer) { + ret = -ENOMEM; + goto out_mmput; + } + out = (u64 *)buffer; + + if (write && copy_from_user(buffer, ubuff, count)) { + ret = -EFAULT; + goto out; + } + + ret = page_idle_get_frames(*pos, count, mm, &start_frame, &end_frame); + if (ret) + goto out; + + start_addr = (start_frame << PAGE_SHIFT); + end_addr = (end_frame << PAGE_SHIFT); + priv.buffer = buffer; + priv.start_addr = start_addr; + priv.write = write; + + priv.idle_page_list = &idle_page_list; + priv.cur_page_node = 0; + priv.page_nodes = kzalloc(sizeof(struct page_node) * + (end_frame - start_frame), GFP_KERNEL); + if (!priv.page_nodes) { + ret = -ENOMEM; + goto out; + } + + walk.private = &priv; + walk.mm = mm; + + down_read(&mm->mmap_sem); + + /* + * idle_page_list is needed because walk_page_vma() holds ptlock which + * deadlocks with page_idle_clear_pte_refs(). So we have to collect all + * pages first, and then call page_idle_clear_pte_refs(). + */ + ret = walk_page_range(start_addr, end_addr, &walk); + if (ret) + walk_error = true; + + list_for_each_entry(cur, &idle_page_list, list) { + int bit, index; + unsigned long off; + struct page *page = cur->page; + + if (unlikely(walk_error)) + goto remove_page; + + if (write) { + if (page) { + page_idle_clear_pte_refs(page); + set_page_idle(page); + } + } else { + if (!page || page_really_idle(page)) { + off = ((cur->addr) >> PAGE_SHIFT) - start_frame; + bit = off % BITMAP_CHUNK_BITS; + index = off / BITMAP_CHUNK_BITS; + out[index] |= 1ULL << bit; + } + } +remove_page: + if (page) + put_page(page); + } + + if (!write && !walk_error) + ret = copy_to_user(ubuff, buffer, count); + + up_read(&mm->mmap_sem); + kfree(priv.page_nodes); +out: + kfree(buffer); +out_mmput: + mmput(mm); + if (!ret) + ret = count; + return ret; + +} + +ssize_t page_idle_proc_read(struct file *file, char __user *ubuff, + size_t count, loff_t *pos, struct task_struct *tsk) +{ + return page_idle_proc_generic(file, ubuff, count, pos, tsk, 0); +} + +ssize_t page_idle_proc_write(struct file *file, char __user *ubuff, + size_t count, loff_t *pos, struct task_struct *tsk) +{ + return page_idle_proc_generic(file, ubuff, count, pos, tsk, 1); +} + static int __init page_idle_init(void) { int err; From patchwork Fri Jul 26 15:23:19 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Joel Fernandes X-Patchwork-Id: 11061247 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8AEB51398 for ; Fri, 26 Jul 2019 15:23:36 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7C1A1286B8 for ; Fri, 26 Jul 2019 15:23:36 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6F7BD288A6; Fri, 26 Jul 2019 15:23:36 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.0 required=2.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3307028787 for ; Fri, 26 Jul 2019 15:23:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2378C6B0008; Fri, 26 Jul 2019 11:23:34 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 1E9756B000A; Fri, 26 Jul 2019 11:23:34 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 089B08E0002; Fri, 26 Jul 2019 11:23:34 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pg1-f200.google.com (mail-pg1-f200.google.com [209.85.215.200]) by kanga.kvack.org (Postfix) with ESMTP id C770D6B0008 for ; Fri, 26 Jul 2019 11:23:33 -0400 (EDT) Received: by mail-pg1-f200.google.com with SMTP id n3so23350879pgh.12 for ; Fri, 26 Jul 2019 08:23:33 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:dkim-signature:from:to:cc:subject:date :message-id:in-reply-to:references:mime-version :content-transfer-encoding; bh=dWQZR/3W1JJGAqPUD7hTWyfWfaLoNTQhqklPLc/rCo0=; b=GSfLQo0L3NIwTqHLI5Q7njP9ACXKuQihwCUO+eIfwnm9Uwl5xKhUW5siiHdjZhOXoz 6X9Cq9PHb2zh4GIcnbS+PJdR2cenJYJA0ywfuMlIoJ1ma4Of9wL49gbYy/26RkeEunfp as1nxrSCLSvbqWjuXS6Ow20MVRjXPIetcM12y/6LtJWVdp9BcHzhbyPfMdDt82utanLE XdHA7Lg1MP1lZezOVWy/jackcVHwVtImqmorrC0Dss1N0HjyI3oMUQ/M4qbqpadNtH7W ipRjTy19t1tPFcisGOT1yQ5mt7zEFv3sRLA0VWJNytOOwg02m9ucp7dp27Vq+1Y1yfqw zu0A== X-Gm-Message-State: APjAAAU9VnoTjsGdvhTYbK8n1026sUkYN9E0/A6YDur+4X+7AEqIBiI+ LNBKHsriyzKKdIr2i6m34xzGbHeYSy54V+X81WJXBW14nWkN14OYVieUirSKRR0jeToc1NofGvl GiKmNENwVegB+DMa6Rlzs4psYr3/h7rR8qv9q5JSyN/fdoM/Ym33/43VMnG+rTpRB9Q== X-Received: by 2002:aa7:8189:: with SMTP id g9mr23171640pfi.143.1564154613490; Fri, 26 Jul 2019 08:23:33 -0700 (PDT) X-Received: by 2002:aa7:8189:: with SMTP id g9mr23171572pfi.143.1564154612659; Fri, 26 Jul 2019 08:23:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1564154612; cv=none; d=google.com; s=arc-20160816; b=yXkyi03IbWa2HxlW+h2pXvs8SqFLWVKJ1K1ACl5AlqYxIAREDxAl5JBnE4xXYXupLD 700otgVfN/2baRBcIp+zQdaHMve8UzEgR+PssN/iLE3DX6oa+w8caUeZvUct11EroaG+ eETLL346+CHErP1g8+WpDfaJwOdalqC7CtiyEMffTFsAzFZeqlcK64z4crKNX+KSGXxp 2as6M1n2gnVTRh39RDFaZrNRjQMzYBDCy/OyIYbChM11Byko9CD6UcErVsiNr8kVjGQc /9p6rqOmnWmUkC6hSaRFn9+p8AqZXskHo+VQigMunvqDNoYYFkIry1VXbWO51Qftlf0f 6vTA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:dkim-signature; bh=dWQZR/3W1JJGAqPUD7hTWyfWfaLoNTQhqklPLc/rCo0=; b=WIwzzvk/6sSERP6UbjdTIfz2qmaPSn595+X0kSrZw2vGdldMlJvMommywRdWL60jbO dGrGUrmVJcS90e39XSE9FVGjChzNHFrYUTJN+/P+V5fYohL5vSgNKWlKlXS4YWnnMmzj wKiDtgPrF/Ae6c4d2T0CE+IXoQN2P4Eog+Ye2gHMFsVSM0qJRlFpIufbFU8jnuF1wbco DM/AX6+BuQW9qHXo6ss4D6nOukGKCKinI8nnV/Zn+sMHqrO/mLE8y2L5rG+Eaem0sukh S0BN7AOmcDbAyuI210VkEMCOn4VOUyBUjJjWUy9AEnImYsuFSX8r/2is7LpmxpmCZS1S 4Asw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=NSm723gH; spf=pass (google.com: domain of joel@joelfernandes.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=joel@joelfernandes.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id 1sor64080160plw.7.2019.07.26.08.23.32 for (Google Transport Security); Fri, 26 Jul 2019 08:23:32 -0700 (PDT) Received-SPF: pass (google.com: domain of joel@joelfernandes.org designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=NSm723gH; spf=pass (google.com: domain of joel@joelfernandes.org designates 209.85.220.65 as permitted sender) smtp.mailfrom=joel@joelfernandes.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joelfernandes.org; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=dWQZR/3W1JJGAqPUD7hTWyfWfaLoNTQhqklPLc/rCo0=; b=NSm723gHApWanD6yeW4J2Hod3LusoD9q5rDYeMMOfmnvzy1V6IW8Ik70kYPIywWp+5 ePTv11kUJPhco3rtrxGscl1Xvn/9LGCfMjCA63hEQGhn0+RLeCWphNUjv25+mG/EUvI5 IipB8N+66gykX0zBQeBiqZbzJqprH2vYSF24M= X-Google-Smtp-Source: APXvYqxDFJfffeebJNKJPBtRQh+QrCwtMX6C8cQ3ucTqzwQ23dCczXPmGvaj1O0CVDvfDxhY8WADAQ== X-Received: by 2002:a17:902:be03:: with SMTP id r3mr97943466pls.156.1564154612178; Fri, 26 Jul 2019 08:23:32 -0700 (PDT) Received: from joelaf.cam.corp.google.com ([2620:15c:6:12:9c46:e0da:efbf:69cc]) by smtp.gmail.com with ESMTPSA id w132sm55268640pfd.78.2019.07.26.08.23.28 (version=TLS1_3 cipher=AEAD-AES256-GCM-SHA384 bits=256/256); Fri, 26 Jul 2019 08:23:31 -0700 (PDT) From: "Joel Fernandes (Google)" To: linux-kernel@vger.kernel.org Cc: "Joel Fernandes (Google)" , Alexey Dobriyan , Andrew Morton , Brendan Gregg , Christian Hansen , dancol@google.com, fmayer@google.com, joaodias@google.com, joelaf@google.com, Jonathan Corbet , Kees Cook , kernel-team@android.com, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Michal Hocko , Mike Rapoport , minchan@kernel.org, namhyung@google.com, Roman Gushchin , Stephen Rothwell , surenb@google.com, tkjos@google.com, Vladimir Davydov , Vlastimil Babka , wvw@google.com Subject: [PATCH v3 2/2] doc: Update documentation for page_idle virtual address indexing Date: Fri, 26 Jul 2019 11:23:19 -0400 Message-Id: <20190726152319.134152-2-joel@joelfernandes.org> X-Mailer: git-send-email 2.22.0.709.g102302147b-goog In-Reply-To: <20190726152319.134152-1-joel@joelfernandes.org> References: <20190726152319.134152-1-joel@joelfernandes.org> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP This patch updates the documentation with the new page_idle tracking feature which uses virtual address indexing. Signed-off-by: Joel Fernandes (Google) Reviewed-by: Sandeep Patil Reviewed-by: Mike Rapoport --- .../admin-guide/mm/idle_page_tracking.rst | 43 ++++++++++++++++--- 1 file changed, 36 insertions(+), 7 deletions(-) diff --git a/Documentation/admin-guide/mm/idle_page_tracking.rst b/Documentation/admin-guide/mm/idle_page_tracking.rst index df9394fb39c2..1eeac78c94a7 100644 --- a/Documentation/admin-guide/mm/idle_page_tracking.rst +++ b/Documentation/admin-guide/mm/idle_page_tracking.rst @@ -19,10 +19,14 @@ It is enabled by CONFIG_IDLE_PAGE_TRACKING=y. User API ======== +There are 2 ways to access the idle page tracking API. One uses physical +address indexing, another uses a simpler virtual address indexing scheme. -The idle page tracking API is located at ``/sys/kernel/mm/page_idle``. -Currently, it consists of the only read-write file, -``/sys/kernel/mm/page_idle/bitmap``. +Physical address indexing +------------------------- +The idle page tracking API for physical address indexing using page frame +numbers (PFN) is located at ``/sys/kernel/mm/page_idle``. Currently, it +consists of the only read-write file, ``/sys/kernel/mm/page_idle/bitmap``. The file implements a bitmap where each bit corresponds to a memory page. The bitmap is represented by an array of 8-byte integers, and the page at PFN #i is @@ -74,6 +78,31 @@ See :ref:`Documentation/admin-guide/mm/pagemap.rst ` for more information about ``/proc/pid/pagemap``, ``/proc/kpageflags``, and ``/proc/kpagecgroup``. +Virtual address indexing +------------------------ +The idle page tracking API for virtual address indexing using virtual page +frame numbers (VFN) is located at ``/proc//page_idle``. It is a bitmap +that follows the same semantics as ``/sys/kernel/mm/page_idle/bitmap`` +except that it uses virtual instead of physical frame numbers. + +This idle page tracking API does not need deal with PFN so it does not require +prior lookups of ``pagemap`` in order to find if page is idle or not. This is +an advantage on some systems where looking up PFN is considered a security +issue. Also in some cases, this interface could be slightly more reliable to +use than physical address indexing, since in physical address indexing, address +space changes can occur between reading the ``pagemap`` and reading the +``bitmap``, while in virtual address indexing, the process's ``mmap_sem`` is +held for the duration of the access. + +To estimate the amount of pages that are not used by a workload one should: + + 1. Mark all the workload's pages as idle by setting corresponding bits in + ``/proc//page_idle``. + + 2. Wait until the workload accesses its working set. + + 3. Read ``/proc//page_idle`` and count the number of bits set. + .. _impl_details: Implementation Details @@ -99,10 +128,10 @@ When a dirty page is written to swap or disk as a result of memory reclaim or exceeding the dirty memory limit, it is not marked referenced. The idle memory tracking feature adds a new page flag, the Idle flag. This flag -is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` (see the -:ref:`User API ` -section), and cleared automatically whenever a page is referenced as defined -above. +is set manually, by writing to ``/sys/kernel/mm/page_idle/bitmap`` for physical +addressing or by writing to ``/proc//page_idle`` for virtual +addressing (see the :ref:`User API ` section), and cleared +automatically whenever a page is referenced as defined above. When a page is marked idle, the Accessed bit must be cleared in all PTEs it is mapped to, otherwise we will not be able to detect accesses to the page coming