From patchwork Sat Sep 1 11:28:20 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Fengguang Wu X-Patchwork-Id: 10584931 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 987525A4 for ; Sun, 2 Sep 2018 02:21:13 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 792C329FC7 for ; Sun, 2 Sep 2018 02:21:13 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 6B06729FCA; Sun, 2 Sep 2018 02:21:13 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.9 required=2.0 tests=BAYES_00,DATE_IN_PAST_12_24, MAILING_LIST_MULTI,RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9895A29FC7 for ; Sun, 2 Sep 2018 02:21:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CBFF26B5FDD; Sat, 1 Sep 2018 22:21:10 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id C478C6B5FDB; Sat, 1 Sep 2018 22:21:10 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AC5F76B5FDC; Sat, 1 Sep 2018 22:21:10 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pg1-f200.google.com (mail-pg1-f200.google.com [209.85.215.200]) by kanga.kvack.org (Postfix) with ESMTP id 651B16B5FD8 for ; Sat, 1 Sep 2018 22:21:10 -0400 (EDT) Received: by mail-pg1-f200.google.com with SMTP id r20-v6so8862753pgv.20 for ; Sat, 01 Sep 2018 19:21:10 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:message-id :user-agent:date:from:to:cc:cc:cc:cc:cc:cc:cc:subject:references :mime-version:content-disposition:lines; bh=EJZQHVKM1S3q2m2s8ocvE5ZqVMPUZK9QTxK1UM20ckE=; b=uD/OXxG7AzEjvk1J1P44URe0LXm9d+ylf5hkBwOx+Rrm6HJjHR0/t0x692SofZr81c GpdRDkJ6sNGOlLDnvnwbaF+V5L24M5cyYc3upvnUOGoQsyOln5VaXfXU59/w3UeKM9tE Eovt/7DzEtBGTlJ3F2uBT6QOTXw5nrmRhXAi3nHOyMp+DocoOzeJQPsuZemLNm05Ji+3 F7UXMLKVu6RRR6xxkXg0J0D0uy9mYAXIAFyi/V1jYtkYgo+x0VmxPciDnDmGAgR9m6Cj P90PaWv5qocPNJ5L1Mbm5fscyquVHQeff7lHxaUoA9mlYSiKfYGG1Z0ZzBLu5KMXz8dz UW2w== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of fengguang.wu@intel.com designates 134.134.136.20 as permitted sender) smtp.mailfrom=fengguang.wu@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Gm-Message-State: APzg51DDVf9R++xQo7js7Fpivi4c8WoE6qJVgXO94yhnRQp3NeHFD1tS 9grZjPao0+ajMW72d+G5KnC37HF0WGWCIpHQBWfzpmZWfZaBqg6XvkkQJt+Hg5UGu6yt2Z+1U+y evpRq26S73iCpkTqX0df4Mgpnf56dWfQPhjwxAGL95iAyuozVHQTlRUAEvtpY6Sb/pg== X-Received: by 2002:a17:902:d90a:: with SMTP id c10-v6mr2224873plz.35.1535854870067; Sat, 01 Sep 2018 19:21:10 -0700 (PDT) X-Google-Smtp-Source: ANB0VdZvH4VpeOPR8Sdv2trw/wsFj7dXPvfBjfFifNSSWNgc2kaajxVlaN18XXjPajAmicw+8tuw X-Received: by 2002:a17:902:d90a:: with SMTP id c10-v6mr2224852plz.35.1535854869439; Sat, 01 Sep 2018 19:21:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1535854869; cv=none; d=google.com; s=arc-20160816; b=dTofRsnMpFUFHfBHJSjWkzVrGKoj4edZKMH3cCRfnfA+wCYFydH1ARSmuSc0koGZK+ APAaWIim/m09sjbq1qvs3vkRRC+UjH9pW2Y6z4D2Rc1uDa7JQbOT5F8Uea9RZvRZu4XT xN1n4NzSxH7eSPZ0pIO21e8iRPqN3YVuikarEVEX9tLPhaJaK//W6P8a5VVElcIUFBkS 73+AXDnJIJ9y4cKd1ZOX+O5v3aWwS5n+4+WuQjpMPmuDpBh7724B6R3njKmW4Ost31Y8 Ffum8/wg5x99RgJAWSRJqNH87NEECBBcXnOl2RekHc9Ns6fwQ+JbORBE3bYQmd2EH2wa RglQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=lines:content-disposition:mime-version:references:subject:cc:cc:cc :cc:cc:cc:cc:to:from:date:user-agent:message-id :arc-authentication-results; bh=EJZQHVKM1S3q2m2s8ocvE5ZqVMPUZK9QTxK1UM20ckE=; b=BWMyuVeFGs/bRFggZMi6NFJog0POOy/RD7YXlo2NASK/Mum4aTppDmjj5yceSTrAK1 69mFcQxAcb+MLz5UXffT4alEHgQeQSIKDstoNTyDn7wyz0tXsbvmy1WkWA9z3Yuiun1m rGe6olxcIx+CaAjmeusSulnw3QeCoYg/WlbIERhIDu3roVGyU/D+A2hDN4XtDXBxWC2N wA1pZ1IloVEEv55ARxWR2KDNF2cExHlEV9S3YymAZFS/g4l64801AWSdSMhBz3pzbAt2 XG17MUheci26SK+IfvHc3To0TVpvRX6TdrXDbJcJ/WgUseKjpSHKVbWn4yrxbu7eIf8U 8aXw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of fengguang.wu@intel.com designates 134.134.136.20 as permitted sender) smtp.mailfrom=fengguang.wu@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from mga02.intel.com (mga02.intel.com. [134.134.136.20]) by mx.google.com with ESMTPS id x19-v6si12798679pgf.477.2018.09.01.19.21.09 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 01 Sep 2018 19:21:09 -0700 (PDT) Received-SPF: pass (google.com: domain of fengguang.wu@intel.com designates 134.134.136.20 as permitted sender) client-ip=134.134.136.20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of fengguang.wu@intel.com designates 134.134.136.20 as permitted sender) smtp.mailfrom=fengguang.wu@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Amp-Result: UNSCANNABLE X-Amp-File-Uploaded: False Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 01 Sep 2018 19:21:08 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.53,318,1531810800"; d="scan'208";a="77329555" Received: from dbxu-mobl.ccr.corp.intel.com (HELO wfg-t570.sh.intel.com) ([10.254.212.218]) by FMSMGA003.fm.intel.com with ESMTP; 01 Sep 2018 19:20:58 -0700 Received: from wfg by wfg-t570.sh.intel.com with local (Exim 4.89) (envelope-from ) id 1fwI0X-0003Ze-5w; Sun, 02 Sep 2018 10:20:57 +0800 Message-Id: <20180901124811.530300789@intel.com> User-Agent: quilt/0.63-1 Date: Sat, 01 Sep 2018 19:28:20 +0800 From: Fengguang Wu To: Andrew Morton cc: Linux Memory Management List , Huang Ying , Brendan Gregg , Fengguang Wu cc: Peng DongX cc: Liu Jingqi cc: Dong Eddie CC: Dave Hansen cc: kvm@vger.kernel.org Cc: LKML Subject: [RFC][PATCH 2/5] [PATCH 2/5] proc: introduce /proc/PID/idle_bitmap References: <20180901112818.126790961@intel.com> MIME-Version: 1.0 Content-Disposition: inline; filename=0002-proc-introduce-proc-PID-idle_bitmap.patch Lines: 158 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP This will be similar to /sys/kernel/mm/page_idle/bitmap documented in Documentation/admin-guide/mm/idle_page_tracking.rst, however indexed by process virtual address. When using the global PFN indexed idle bitmap, we find 2 kind of overheads: - to track a task's working set, Brendan Gregg end up writing wss-v1 for small tasks and wss-v2 for large tasks: https://github.com/brendangregg/wss That's because VAs may point to random PAs throughout the physical address space. So we either query /proc/pid/pagemap first and access the lots of random PFNs (with lots of syscalls) in the bitmap, or write+read the whole system idle bitmap beforehand. - page table walking by PFN has much more overheads than to walk a page table in its natural order: - rmap queries - more locking - random memory reads/writes This interface provides a cheap path for the majority non-shared mapping pages. To walk 1TB memory of 4k active pages, it costs 2s vs 15s system time to scan the per-task/global idle bitmaps. Which means ~7x speedup. The gap will be enlarged if consider - the extra /proc/pid/pagemap walk - natural page table walks can skip the whole 512 PTEs if PMD is idle OTOH, the per-task idle bitmap is not suitable in some situations: - not accurate for shared pages - don't work with non-mapped file pages - don't perform well for sparse page tables (pointed out by Huang Ying) So it's more about complementing the existing global idle bitmap. CC: Huang Ying CC: Brendan Gregg Signed-off-by: Fengguang Wu --- fs/proc/base.c | 2 ++ fs/proc/internal.h | 1 + fs/proc/task_mmu.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 66 insertions(+) diff --git a/fs/proc/base.c b/fs/proc/base.c index aaffc0c30216..d81322b5b8d2 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2942,6 +2942,7 @@ static const struct pid_entry tgid_base_stuff[] = { REG("smaps", S_IRUGO, proc_pid_smaps_operations), REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), + REG("idle_bitmap", S_IRUSR|S_IWUSR, proc_mm_idle_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), @@ -3327,6 +3328,7 @@ static const struct pid_entry tid_base_stuff[] = { REG("smaps", S_IRUGO, proc_tid_smaps_operations), REG("smaps_rollup", S_IRUGO, proc_pid_smaps_rollup_operations), REG("pagemap", S_IRUSR, proc_pagemap_operations), + REG("idle_bitmap", S_IRUSR|S_IWUSR, proc_mm_idle_operations), #endif #ifdef CONFIG_SECURITY DIR("attr", S_IRUGO|S_IXUGO, proc_attr_dir_inode_operations, proc_attr_dir_operations), diff --git a/fs/proc/internal.h b/fs/proc/internal.h index da3dbfa09e79..732a502acc27 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -305,6 +305,7 @@ extern const struct file_operations proc_pid_smaps_rollup_operations; extern const struct file_operations proc_tid_smaps_operations; extern const struct file_operations proc_clear_refs_operations; extern const struct file_operations proc_pagemap_operations; +extern const struct file_operations proc_mm_idle_operations; extern unsigned long task_vsize(struct mm_struct *); extern unsigned long task_statm(struct mm_struct *, diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index dfd73a4616ce..376406a9cf45 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1564,6 +1564,69 @@ const struct file_operations proc_pagemap_operations = { .open = pagemap_open, .release = pagemap_release, }; + +/* will be filled when kvm_ept_idle module loads */ +struct file_operations proc_ept_idle_operations = { +}; +EXPORT_SYMBOL_GPL(proc_ept_idle_operations); + +static ssize_t mm_idle_read(struct file *file, char __user *buf, + size_t count, loff_t *ppos) +{ + struct task_struct *task = file->private_data; + ssize_t ret = -ESRCH; + + // TODO: implement mm_walk for normal tasks + + if (task_kvm(task)) { + if (proc_ept_idle_operations.read) + return proc_ept_idle_operations.read(file, buf, count, ppos); + } + + return ret; +} + + +static int mm_idle_open(struct inode *inode, struct file *file) +{ + struct task_struct *task = get_proc_task(inode); + + if (!task) + return -ESRCH; + + file->private_data = task; + + if (task_kvm(task)) { + if (proc_ept_idle_operations.open) + return proc_ept_idle_operations.open(inode, file); + } + + return 0; +} + +static int mm_idle_release(struct inode *inode, struct file *file) +{ + struct task_struct *task = file->private_data; + + if (!task) + return 0; + + if (task_kvm(task)) { + if (proc_ept_idle_operations.release) + return proc_ept_idle_operations.release(inode, file); + } + + put_task_struct(task); + return 0; +} + +const struct file_operations proc_mm_idle_operations = { + .llseek = mem_lseek, /* borrow this */ + .read = mm_idle_read, + .open = mm_idle_open, + .release = mm_idle_release, +}; + #endif /* CONFIG_PROC_PAGE_MONITOR */ #ifdef CONFIG_NUMA