From patchwork Mon Dec 7 11:30:54 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955329 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4C276C4167B for ; Mon, 7 Dec 2020 11:32:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1219F233A0 for ; Mon, 7 Dec 2020 11:32:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726798AbgLGLcd (ORCPT ); Mon, 7 Dec 2020 06:32:33 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40482 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726790AbgLGLcb (ORCPT ); Mon, 7 Dec 2020 06:32:31 -0500 Received: from mail-pf1-x443.google.com (mail-pf1-x443.google.com [IPv6:2607:f8b0:4864:20::443]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 61917C0613D0; Mon, 7 Dec 2020 03:31:45 -0800 (PST) Received: by mail-pf1-x443.google.com with SMTP id c79so9606637pfc.2; Mon, 07 Dec 2020 03:31:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ewpiaZqTRGdJ05S6FvGPc0qF7vcc593bt6ShYsuckHQ=; b=RPs0JFNrA34b2v6m4o0WKzzEMoIgFha8rYwcJlJYFwLi4ku7d2HYbJ7wKa//+yfUBX tJ3QIj9uXjYAV7QDBPF4i8kWN/WukA9vB1cJPlVE8J6zlIPP6dECt3+Gy067gGDUNMfT pq4e3VandG+gI0X1miFP+Ome0AN6naVoR6X8H6B3T3wm6ewpqwDAyeGbGhKie5qRvkry 7yhUTV47Un0jRI9sEq2r17tOIMitEH+ebYmdWLf4zCGTHh9IVoeUf8FE6QbNebsYSzAi pNi6cX4KjaxtUH7vfGNTEGN+cfuBtxZeaa5w5yXo88wxc/a85VdJd282dDdVw8c3iW1Q p9eQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ewpiaZqTRGdJ05S6FvGPc0qF7vcc593bt6ShYsuckHQ=; b=sbhDI1PjE1U3BKnCMrFIfes99h7royl/4sU1UGkeL4p1CWx29sDp3xLL2JQr0NXSVP f7JfPKx7anfkTX054xWF7iAB9vqTXgnHWGZH27rCEiSP/N2ypkKZE5548ZiPEhtno2ut UTHxvcrF7mUl/gONcDN1kJivGc9I8RPxSgiX+wMuehhjJtffqPHSTG6AOWxJY+qpqRdJ r3rJYbTc1T2PY1aTUZAt8SNg+Ex9HDFgEub82BRY2H9JdN/otqWQi5GzNC5soXVzUoBz Fvcpem59njBQ2u/QlDZ5IG9lT4ELtQrIvOUqFjL14WVd9js04zfGwCZYQNCL2fC8leuW yz1Q== X-Gm-Message-State: AOAM5325RfVR96+MDZLAC58QoA2fwqZDGgRv3wg4o00U5izWAMwLXQ/N ExEIhFlmoY9QQ10R19tM+8c= X-Google-Smtp-Source: ABdhPJyMmiiecC7lQBFkhMTn89Ja1IhAR/H6fUYhXm8PI2Ykcxu8WlFJkRhL9epCDiG6RPld96HGHg== X-Received: by 2002:a17:902:8341:b029:d8:d123:2297 with SMTP id z1-20020a1709028341b02900d8d1232297mr15889510pln.65.1607340704982; Mon, 07 Dec 2020 03:31:44 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.31.41 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:31:44 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [RFC V2 01/37] fs: introduce dmemfs module Date: Mon, 7 Dec 2020 19:30:54 +0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang dmemfs (Direct Memory filesystem) is device memory or reserved memory based filesystem. This kind of memory is special as it is not managed by kernel and it is not associated with 'struct page'. The original purpose for dmemfs is to drop the usage of 'struct page' to save extra system memory in public cloud enviornment. This patch introduces the basic framework of dmemfs and only mkdir and create regular file are supported. Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- fs/Kconfig | 1 + fs/Makefile | 1 + fs/dmemfs/Kconfig | 13 +++ fs/dmemfs/Makefile | 7 ++ fs/dmemfs/inode.c | 266 +++++++++++++++++++++++++++++++++++++++++++++ include/uapi/linux/magic.h | 1 + 6 files changed, 289 insertions(+) create mode 100644 fs/dmemfs/Kconfig create mode 100644 fs/dmemfs/Makefile create mode 100644 fs/dmemfs/inode.c diff --git a/fs/Kconfig b/fs/Kconfig index aa4c122..18e7208 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -41,6 +41,7 @@ source "fs/btrfs/Kconfig" source "fs/nilfs2/Kconfig" source "fs/f2fs/Kconfig" source "fs/zonefs/Kconfig" +source "fs/dmemfs/Kconfig" config FS_DAX bool "Direct Access (DAX) support" diff --git a/fs/Makefile b/fs/Makefile index 999d1a2..34747ec 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -136,3 +136,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/ obj-$(CONFIG_EROFS_FS) += erofs/ obj-$(CONFIG_VBOXSF_FS) += vboxsf/ obj-$(CONFIG_ZONEFS_FS) += zonefs/ +obj-$(CONFIG_DMEM_FS) += dmemfs/ diff --git a/fs/dmemfs/Kconfig b/fs/dmemfs/Kconfig new file mode 100644 index 00000000..d2894a5 --- /dev/null +++ b/fs/dmemfs/Kconfig @@ -0,0 +1,13 @@ +config DMEM_FS + tristate "Direct Memory filesystem support" + help + dmemfs (Direct Memory filesystem) is device memory or reserved + memory based filesystem. This kind of memory is special as it + is not managed by kernel and it is without 'struct page'. + + The original purpose of dmemfs is saving extra memory of + 'struct page' that reduces the total cost of ownership (TCO) + for cloud providers. + + To compile this file system support as a module, choose M here: the + module will be called dmemfs. diff --git a/fs/dmemfs/Makefile b/fs/dmemfs/Makefile new file mode 100644 index 00000000..73bdc9c --- /dev/null +++ b/fs/dmemfs/Makefile @@ -0,0 +1,7 @@ +# SPDX-License-Identifier: GPL-2.0 +# +# Makefile for the linux dmem-filesystem routines. +# +obj-$(CONFIG_DMEM_FS) += dmemfs.o + +dmemfs-y += inode.o diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c new file mode 100644 index 00000000..0aa3d3b --- /dev/null +++ b/fs/dmemfs/inode.c @@ -0,0 +1,266 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * linux/fs/dmemfs/inode.c + * + * Authors: + * Xiao Guangrong + * Chen Zhuo + * Haiwei Li + * Yulei Zhang + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Tencent Corporation"); +MODULE_LICENSE("GPL v2"); + +struct dmemfs_mount_opts { + unsigned long dpage_size; +}; + +struct dmemfs_fs_info { + struct dmemfs_mount_opts mount_opts; +}; + +enum dmemfs_param { + Opt_dpagesize, +}; + +const struct fs_parameter_spec dmemfs_fs_parameters[] = { + fsparam_string("pagesize", Opt_dpagesize), + {} +}; + +static int check_dpage_size(unsigned long dpage_size) +{ + if (dpage_size != PAGE_SIZE && dpage_size != PMD_SIZE && + dpage_size != PUD_SIZE) + return -EINVAL; + + return 0; +} + +static struct inode * +dmemfs_get_inode(struct super_block *sb, const struct inode *dir, umode_t mode); + +static int +__create_file(struct inode *dir, struct dentry *dentry, umode_t mode) +{ + struct inode *inode = dmemfs_get_inode(dir->i_sb, dir, mode); + int error = -ENOSPC; + + if (inode) { + d_instantiate(dentry, inode); + dget(dentry); /* Extra count - pin the dentry in core */ + error = 0; + dir->i_mtime = dir->i_ctime = current_time(inode); + if (mode & S_IFDIR) + inc_nlink(dir); + } + return error; +} + +static int dmemfs_create(struct inode *dir, struct dentry *dentry, + umode_t mode, bool excl) +{ + return __create_file(dir, dentry, mode | S_IFREG); +} + +static int dmemfs_mkdir(struct inode *dir, struct dentry *dentry, + umode_t mode) +{ + return __create_file(dir, dentry, mode | S_IFDIR); +} + +static const struct inode_operations dmemfs_dir_inode_operations = { + .create = dmemfs_create, + .lookup = simple_lookup, + .unlink = simple_unlink, + .mkdir = dmemfs_mkdir, + .rmdir = simple_rmdir, + .rename = simple_rename, +}; + +static const struct inode_operations dmemfs_file_inode_operations = { + .setattr = simple_setattr, + .getattr = simple_getattr, +}; + +static const struct file_operations dmemfs_file_operations = { +}; + +static int dmemfs_parse_param(struct fs_context *fc, struct fs_parameter *param) +{ + struct dmemfs_fs_info *fsi = fc->s_fs_info; + struct fs_parse_result result; + int opt, ret; + + opt = fs_parse(fc, dmemfs_fs_parameters, param, &result); + if (opt < 0) + return opt; + + switch (opt) { + case Opt_dpagesize: + fsi->mount_opts.dpage_size = memparse(param->string, NULL); + ret = check_dpage_size(fsi->mount_opts.dpage_size); + if (ret) { + pr_warn("dmemfs: unknown pagesize %x.\n", + result.uint_32); + return ret; + } + break; + default: + pr_warn("dmemfs: unknown mount option [%x].\n", + opt); + return -EINVAL; + } + + return 0; +} + +struct inode *dmemfs_get_inode(struct super_block *sb, + const struct inode *dir, umode_t mode) +{ + struct inode *inode = new_inode(sb); + + if (inode) { + inode->i_ino = get_next_ino(); + inode_init_owner(inode, dir, mode); + inode->i_mapping->a_ops = &empty_aops; + mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); + mapping_set_unevictable(inode->i_mapping); + inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode); + switch (mode & S_IFMT) { + default: + init_special_inode(inode, mode, 0); + break; + case S_IFREG: + inode->i_op = &dmemfs_file_inode_operations; + inode->i_fop = &dmemfs_file_operations; + break; + case S_IFDIR: + inode->i_op = &dmemfs_dir_inode_operations; + inode->i_fop = &simple_dir_operations; + + /* + * directory inodes start off with i_nlink == 2 + * (for "." entry) + */ + inc_nlink(inode); + break; + case S_IFLNK: + inode->i_op = &page_symlink_inode_operations; + break; + } + } + return inode; +} + +static int dmemfs_statfs(struct dentry *dentry, struct kstatfs *buf) +{ + simple_statfs(dentry, buf); + buf->f_bsize = dentry->d_sb->s_blocksize; + + return 0; +} + +static const struct super_operations dmemfs_ops = { + .statfs = dmemfs_statfs, + .drop_inode = generic_delete_inode, +}; + +static int +dmemfs_fill_super(struct super_block *sb, struct fs_context *fc) +{ + struct inode *inode; + struct dmemfs_fs_info *fsi = sb->s_fs_info; + + sb->s_maxbytes = MAX_LFS_FILESIZE; + sb->s_blocksize = fsi->mount_opts.dpage_size; + sb->s_blocksize_bits = ilog2(fsi->mount_opts.dpage_size); + sb->s_magic = DMEMFS_MAGIC; + sb->s_op = &dmemfs_ops; + sb->s_time_gran = 1; + + inode = dmemfs_get_inode(sb, NULL, S_IFDIR); + sb->s_root = d_make_root(inode); + if (!sb->s_root) + return -ENOMEM; + + return 0; +} + +static int dmemfs_get_tree(struct fs_context *fc) +{ + return get_tree_nodev(fc, dmemfs_fill_super); +} + +static void dmemfs_free_fc(struct fs_context *fc) +{ + kfree(fc->s_fs_info); +} + +static const struct fs_context_operations dmemfs_context_ops = { + .free = dmemfs_free_fc, + .parse_param = dmemfs_parse_param, + .get_tree = dmemfs_get_tree, +}; + +int dmemfs_init_fs_context(struct fs_context *fc) +{ + struct dmemfs_fs_info *fsi; + + fsi = kzalloc(sizeof(*fsi), GFP_KERNEL); + if (!fsi) + return -ENOMEM; + + fsi->mount_opts.dpage_size = PAGE_SIZE; + fc->s_fs_info = fsi; + fc->ops = &dmemfs_context_ops; + return 0; +} + +static void dmemfs_kill_sb(struct super_block *sb) +{ + kill_litter_super(sb); +} + +static struct file_system_type dmemfs_fs_type = { + .owner = THIS_MODULE, + .name = "dmemfs", + .init_fs_context = dmemfs_init_fs_context, + .kill_sb = dmemfs_kill_sb, +}; + +static int __init dmemfs_init(void) +{ + int ret; + + ret = register_filesystem(&dmemfs_fs_type); + + return ret; +} + +static void __exit dmemfs_uninit(void) +{ + unregister_filesystem(&dmemfs_fs_type); +} + +module_init(dmemfs_init) +module_exit(dmemfs_uninit) diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index f3956fc..3fbd066 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -97,5 +97,6 @@ #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ #define Z3FOLD_MAGIC 0x33 #define PPC_CMM_MAGIC 0xc7571590 +#define DMEMFS_MAGIC 0x2ace90c6 #endif /* __LINUX_MAGIC_H__ */ From patchwork Mon Dec 7 11:30:55 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955331 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7EFA5C433FE for ; Mon, 7 Dec 2020 11:34:27 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 474AC233F6 for ; Mon, 7 Dec 2020 11:34:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726415AbgLGLeJ (ORCPT ); Mon, 7 Dec 2020 06:34:09 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40758 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726198AbgLGLeI (ORCPT ); Mon, 7 Dec 2020 06:34:08 -0500 Received: from mail-pf1-x42e.google.com (mail-pf1-x42e.google.com [IPv6:2607:f8b0:4864:20::42e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A406AC0613D0; Mon, 7 Dec 2020 03:33:28 -0800 (PST) Received: by mail-pf1-x42e.google.com with SMTP id 131so9596648pfb.9; Mon, 07 Dec 2020 03:33:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=aHAaJZi4VQAK0yBssImPSXAXLPsQCU0DdRybKh82TfM=; b=FrpcGDmXQu7AxQyiNFxzQHlkKAmF8U/kxPQ8ibnwy+4FIWEA2w4Tp+hinEQx/f2s8C uBSBW9ubA9c1qzZCth4xGF/xVVOG/Wrk2F3W2/gNzLGYQu5k5zu8Zh/YHL36I+/Ri1Ry o47M9yB8l4QxB3btGHSMgZfdajF5cwyV8iAWGojkCSkPOwlRq23xXkzLI59awhcZwpGU 03eLKJREMJLwsLCO7Oj8N2s0rFDuxYVff2buZlOEF0ptz43qd52CasDbGpNINSHlkTUX kDvBei16cnJCFMAy3LloT9iF/QFNsCSPspaK9P3wqHq/Lgmass178kFe05z0W09loOFT SK7g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=aHAaJZi4VQAK0yBssImPSXAXLPsQCU0DdRybKh82TfM=; b=Nyj92W+ScX6JJJQ+Xgn2ZvlefkHJjP/k1AaiM/R8nSjMNDKAT5NzpfAtSAMwvgeW1K 5JqXTE248qXdXG+3pT4pj4fNpHukHn9gx8TKCYH0SwYLwiHdbwpfEuhvLii0wAb+RVdl XNKqH9wdZHv/eXbWsv/nKaZSisNca0TWDANEWy6y+I5XTaKeP0K3izmFpy4U0eLoTpyl YW0Ve40rRs+tTBbN4IcpoCzSHdvMErI97Qgje9tTJR57d9CKbHZRIoWeCtxf9lO4REZL uLXNM4CBXb/pvj6TUR502L46eeN5EdzW8nEZ8s65Fuo5CM8H6GfdkBM814NgwPD+elEZ rTHw== X-Gm-Message-State: AOAM532FlMbwliBtLVcZyG4RWiiiSIMqvXXoSVbQRoi/enOy1HyFxOr+ COJiE5fVEbk9vHL/iDnjPr/UjZoLGGs= X-Google-Smtp-Source: ABdhPJxcIYSZ+aoxUJo39tPUtbCw9RyCJxtFEXKfpOIRIp5yehcE9scRL2sUinFiN1S8sA7OsPpSOg== X-Received: by 2002:a63:4703:: with SMTP id u3mr18044209pga.199.1607340808062; Mon, 07 Dec 2020 03:33:28 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.33.25 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:33:27 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [RFC V2 02/37] mm: support direct memory reservation Date: Mon, 7 Dec 2020 19:30:55 +0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang Introduce 'dmem=' to reserve system memory for DMEM (direct memory), comparing with 'mem=' and 'memmap', it reserves memory based on the topology of NUMA, for the detailed info, please refer to kernel-parameters.txt Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- Documentation/admin-guide/kernel-parameters.txt | 38 +++ arch/x86/kernel/setup.c | 3 + include/linux/dmem.h | 16 ++ mm/Kconfig | 8 + mm/Makefile | 1 + mm/dmem.c | 137 +++++++++++ mm/dmem_reserve.c | 303 ++++++++++++++++++++++++ 7 files changed, 506 insertions(+) create mode 100644 include/linux/dmem.h create mode 100644 mm/dmem.c create mode 100644 mm/dmem_reserve.c diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 526d65d..78caf11 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -991,6 +991,44 @@ The filter can be disabled or changed to another driver later using sysfs. + dmem=[!]size[KMG] + [KNL, NUMA] When CONFIG_DMEM is set, this means + the size of memory reserved for dmemfs on each NUMA + memory node and 'size' must be aligned to the default + alignment that is the size of memory section which is + 128M by default on x86_64. If set '!', such amount of + memory on each node will be owned by kernel and dmemfs + owns the rest of memory on each node. + Example: Reserve 4G memory on each node for dmemfs + dmem = 4G + + dmem=[!]size[KMG]:align[KMG] + [KNL, NUMA] Ditto. 'align' should be power of two and + not smaller than the default alignment. Also 'size' + must be aligned to 'align'. + Example: Bad dmem parameter because 'size' misaligned + dmem=0x40200000:1G + + dmem=size[KMG]@addr[KMG] + [KNL] When CONFIG_DMEM is set, this marks specific + memory as reserved for dmemfs. Region of memory will be + used by dmemfs, from addr to addr + size. Reserving a + certain memory region for kernel is illegal so '!' is + forbidden. Should not assign 'addr' to 0 because kernel + will occupy fixed memory region beginning at 0 address. + Ditto, 'size' and 'addr' must be aligned to default + alignment. + Example: Exclude memory from 5G-6G for dmemfs. + dmem=1G@5G + + dmem=size[KMG]@addr[KMG]:align[KMG] + [KNL] Ditto. 'align' should be power of two and + not smaller than the default alignment. Also 'size' + and 'addr' must be aligned to 'align'. Specially, + '@addr' and ':align' could occur in any order. + Example: Exclude memory from 5G-6G for dmemfs. + dmem=1G:1G@5G + driver_async_probe= [KNL] List of driver names to be probed asynchronously. Format: ,... diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 84f581c..9d05e1b 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -48,6 +48,7 @@ #include #include #include +#include /* * max_low_pfn_mapped: highest directly mapped pfn < 4 GB @@ -1149,6 +1150,8 @@ void __init setup_arch(char **cmdline_p) if (boot_cpu_has(X86_FEATURE_GBPAGES)) hugetlb_cma_reserve(PUD_SHIFT - PAGE_SHIFT); + dmem_reserve_init(); + /* * Reserve memory for crash kernel after SRAT is parsed so that it * won't consume hotpluggable memory. diff --git a/include/linux/dmem.h b/include/linux/dmem.h new file mode 100644 index 00000000..5049322 --- /dev/null +++ b/include/linux/dmem.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +#ifndef _LINUX_DMEM_H +#define _LINUX_DMEM_H + +#ifdef CONFIG_DMEM +int dmem_reserve_init(void); +void dmem_init(void); +int dmem_region_register(int node, phys_addr_t start, phys_addr_t end); + +#else +static inline int dmem_reserve_init(void) +{ + return 0; +} +#endif +#endif /* _LINUX_DMEM_H */ diff --git a/mm/Kconfig b/mm/Kconfig index d42423f..3a6d408 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -226,6 +226,14 @@ config BALLOON_COMPACTION scenario aforementioned and helps improving memory defragmentation. # +# support for direct memory basics +config DMEM + bool "Direct Memory Reservation" + depends on SPARSEMEM + help + Allow reservation of memory which could be for the dedicated use of dmem. + It's the basis of dmemfs. + # support for memory compaction config COMPACTION bool "Allow for memory compaction" diff --git a/mm/Makefile b/mm/Makefile index d73aed0..775c8518 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -120,3 +120,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o obj-$(CONFIG_PTDUMP_CORE) += ptdump.o obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o +obj-$(CONFIG_DMEM) += dmem.o dmem_reserve.o diff --git a/mm/dmem.c b/mm/dmem.c new file mode 100644 index 00000000..b5fb4f1 --- /dev/null +++ b/mm/dmem.c @@ -0,0 +1,137 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * memory management for dmemfs + * + * Authors: + * Xiao Guangrong + * Chen Zhuo + * Haiwei Li + * Yulei Zhang + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* + * There are two kinds of page in dmem management: + * - nature page, it's the CPU's page size, i.e, 4K on x86 + * + * - dmem page, it's the unit size used by dmem itself to manage all + * registered memory. It's set by dmem_alloc_init() + */ +struct dmem_region { + /* original registered memory region */ + phys_addr_t reserved_start_addr; + phys_addr_t reserved_end_addr; + + /* memory region aligned to dmem page */ + phys_addr_t dpage_start_pfn; + phys_addr_t dpage_end_pfn; + + /* + * avoid memory allocation if the dmem region is small enough + */ + unsigned long static_bitmap; + unsigned long *bitmap; + u64 next_free_pos; + struct list_head node; + + unsigned long static_error_bitmap; + unsigned long *error_bitmap; +}; + +/* + * statically define number of regions to avoid allocating memory + * dynamically from memblock as slab is not available at that time + */ +#define DMEM_REGION_PAGES 2 +#define INIT_REGION_NUM \ + ((DMEM_REGION_PAGES << PAGE_SHIFT) / sizeof(struct dmem_region)) + +static struct dmem_region static_regions[INIT_REGION_NUM]; + +struct dmem_node { + unsigned long total_dpages; + unsigned long free_dpages; + + /* fallback list for allocation */ + int nodelist[MAX_NUMNODES]; + struct list_head regions; +}; + +struct dmem_pool { + struct mutex lock; + + unsigned long region_num; + unsigned long registered_pages; + unsigned long unaligned_pages; + + /* shift bits of dmem page */ + unsigned long dpage_shift; + + unsigned long total_dpages; + unsigned long free_dpages; + + /* + * increased when allocator is initialized, + * stop it being destroyed when someone is + * still using it + */ + u64 user_count; + struct dmem_node nodes[MAX_NUMNODES]; +}; + +static struct dmem_pool dmem_pool = { + .lock = __MUTEX_INITIALIZER(dmem_pool.lock), +}; + +#define for_each_dmem_node(_dnode) \ + for (_dnode = dmem_pool.nodes; \ + _dnode < dmem_pool.nodes + ARRAY_SIZE(dmem_pool.nodes); \ + _dnode++) + +void __init dmem_init(void) +{ + struct dmem_node *dnode; + + pr_info("dmem: pre-defined region: %ld\n", INIT_REGION_NUM); + + for_each_dmem_node(dnode) + INIT_LIST_HEAD(&dnode->regions); +} + +/* + * register the memory region to dmem pool as freed memory, the region + * should be properly aligned to PAGE_SIZE at least + * + * it's safe to be out of dmem_pool's lock as it's used at the very + * beginning of system boot + */ +int dmem_region_register(int node, phys_addr_t start, phys_addr_t end) +{ + struct dmem_region *dregion; + + pr_info("dmem: register region [%#llx - %#llx] on node %d.\n", + (unsigned long long)start, (unsigned long long)end, node); + + if (unlikely(dmem_pool.region_num >= INIT_REGION_NUM)) { + pr_err("dmem: region is not sufficient.\n"); + return -ENOMEM; + } + + dregion = &static_regions[dmem_pool.region_num++]; + dregion->reserved_start_addr = start; + dregion->reserved_end_addr = end; + + list_add_tail(&dregion->node, &dmem_pool.nodes[node].regions); + dmem_pool.registered_pages += __phys_to_pfn(end) - + __phys_to_pfn(start); + return 0; +} + diff --git a/mm/dmem_reserve.c b/mm/dmem_reserve.c new file mode 100644 index 00000000..567ee9f --- /dev/null +++ b/mm/dmem_reserve.c @@ -0,0 +1,303 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Support reserved memory for dmem. + * As dmem_reserve_init will adjust memblock to reserve memory + * for dmem, we could save a vast amount of memory for 'struct page'. + * + * Authors: + * Xiao Guangrong + */ +#include +#include +#include +#include +#include + +struct dmem_param { + phys_addr_t base; + phys_addr_t size; + phys_addr_t align; + /* + * If set to 1, dmem_param specified requested memory for kernel, + * otherwise for dmem. + */ + bool resv_kernel; +}; + +static struct dmem_param dmem_param __initdata; + +/* Check dmem param defined by user to match dmem align */ +static int __init check_dmem_param(bool resv_kernel, phys_addr_t base, + phys_addr_t size, phys_addr_t align) +{ + phys_addr_t min_align = 1UL << SECTION_SIZE_BITS; + + if (!align) + align = min_align; + + /* + * the reserved region should be aligned to memory section + * at least + */ + if (align < min_align) { + pr_warn("dmem: 'align' should be %#llx at least to be aligned to memory section.\n", + min_align); + return -EINVAL; + } + + if (!is_power_of_2(align)) { + pr_warn("dmem: 'align' should be power of 2.\n"); + return -EINVAL; + } + + if (base & (align - 1)) { + pr_warn("dmem: 'addr' is unaligned to 'align' in dmem=\n"); + return -EINVAL; + } + + if (size & (align - 1)) { + pr_warn("dmem: 'size' is unaligned to 'align' in dmem=\n"); + return -EINVAL; + } + + if (base >= base + size) { + pr_warn("dmem: 'addr + size' overflow in dmem=\n"); + return -EINVAL; + } + + if (resv_kernel && base) { + pr_warn("dmem: take a certain base address for kernel is illegal\n"); + return -EINVAL; + } + + dmem_param.base = base; + dmem_param.size = size; + dmem_param.align = align; + dmem_param.resv_kernel = resv_kernel; + + pr_info("dmem: parameter: base address %#llx size %#llx align %#llx resv_kernel %d\n", + (unsigned long long)base, (unsigned long long)size, + (unsigned long long)align, resv_kernel); + return 0; +} + +static int __init parse_dmem(char *p) +{ + phys_addr_t base, size, align; + char *oldp; + bool resv_kernel = false; + + if (!p) + return -EINVAL; + + base = align = 0; + + if (*p == '!') { + resv_kernel = true; + p++; + } + + oldp = p; + size = memparse(p, &p); + if (oldp == p) + return -EINVAL; + + if (!size) { + pr_warn("dmem: 'size' of 0 defined in dmem=, or {invalid} param\n"); + return -EINVAL; + } + + while (*p) { + phys_addr_t *pvalue; + + switch (*p) { + case '@': + pvalue = &base; + break; + case ':': + pvalue = &align; + break; + default: + pr_warn("dmem: unknown indicator: %c in dmem=\n", *p); + return -EINVAL; + } + + /* + * Some attribute had been specified multiple times. + * This is not allowed. + */ + if (*pvalue) + return -EINVAL; + + oldp = ++p; + *pvalue = memparse(p, &p); + if (oldp == p) + return -EINVAL; + + if (*pvalue == 0) { + pr_warn("dmem: 'addr' or 'align' should not be set to 0\n"); + return -EINVAL; + } + } + + return check_dmem_param(resv_kernel, base, size, align); +} + +early_param("dmem", parse_dmem); + +/* + * We wanna remove a memory range from memblock.memory thoroughly. + * As isolating memblock.memory in memblock_remove needs to double + * the array of memblock_region, allocated memory for new array maybe + * locate in the memory range which we wanna to remove. + * So, conflict. + * To resolve this conflict, here reserve this memory range firstly. + * While reserving this memory range, isolating memory.reserved will allocate + * memory excluded from memory range which to be removed. So following + * double array in memblock_remove can't observe this reserved range. + */ +static void __init dmem_remove_memblock(phys_addr_t base, phys_addr_t size) +{ + memblock_reserve(base, size); + memblock_remove(base, size); + memblock_free(base, size); +} + +static u64 node_req_mem[MAX_NUMNODES] __initdata; + +/* Reserve certain size of memory for dmem in each numa node */ +static void __init dmem_reserve_size(phys_addr_t size, phys_addr_t align, + bool resv_kernel) +{ + phys_addr_t start, end; + u64 i; + int nid; + + /* Calculate available free memory on each node */ + for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, + &end, &nid) + node_req_mem[nid] += end - start; + + /* Calculate memory size needed to reserve on each node for dmem */ + for (i = 0; i < MAX_NUMNODES; i++) { + node_req_mem[i] = ALIGN(node_req_mem[i], align); + + if (!resv_kernel) { + node_req_mem[i] = min(size, node_req_mem[i]); + continue; + } + + /* leave dmem_param.size memory for kernel */ + if (node_req_mem[i] > size) + node_req_mem[i] = node_req_mem[i] - size; + else + node_req_mem[i] = 0; + } + +retry: + for_each_free_mem_range_reverse(i, NUMA_NO_NODE, MEMBLOCK_NONE, + &start, &end, &nid) { + /* Well, we have got enough memory for this node. */ + if (!node_req_mem[nid]) + continue; + + start = round_up(start, align); + end = round_down(end, align); + /* Skip memblock_region which is too small */ + if (start >= end) + continue; + + /* Towards memory block at higher address */ + start = end - min((end - start), node_req_mem[nid]); + + /* + * do not have enough resource to save the region, skip it + * from now on + */ + if (dmem_region_register(nid, start, end) < 0) + break; + + dmem_remove_memblock(start, end - start); + + node_req_mem[nid] -= end - start; + + /* We have dropped a memblock, so re-walk it. */ + goto retry; + } + + for (i = 0; i < MAX_NUMNODES; i++) { + if (!node_req_mem[i]) + continue; + + pr_info("dmem: %#llx size of memory is not reserved on node %lld due to misaligned regions.\n", + (unsigned long long)size, i); + } + +} + +/* Reserve [base, base + size) for dmem. */ +static void __init +dmem_reserve_region(phys_addr_t base, phys_addr_t size, phys_addr_t align) +{ + phys_addr_t start, end; + phys_addr_t p_start, p_end; + u64 i; + int nid; + + p_start = base; + p_end = base + size; + +retry: + for_each_free_mem_range_reverse(i, NUMA_NO_NODE, MEMBLOCK_NONE, + &start, &end, &nid) { + /* Find region located in user defined range. */ + if (start >= p_end || end <= p_start) + continue; + + start = round_up(max(start, p_start), align); + end = round_down(min(end, p_end), align); + if (start >= end) + continue; + + if (dmem_region_register(nid, start, end) < 0) + break; + + dmem_remove_memblock(start, end - start); + + size -= end - start; + if (!size) + return; + + /* We have dropped a memblock, so re-walk it. */ + goto retry; + } + + pr_info("dmem: %#llx size of memory is not reserved for dmem due to holes and misaligned regions in [%#llx, %#llx].\n", + (unsigned long long)size, (unsigned long long)base, + (unsigned long long)(base + size)); +} + +/* Reserve memory for dmem */ +int __init dmem_reserve_init(void) +{ + phys_addr_t base, size, align; + bool resv_kernel; + + dmem_init(); + + base = dmem_param.base; + size = dmem_param.size; + align = dmem_param.align; + resv_kernel = dmem_param.resv_kernel; + + /* Dmem param had not been enabled. */ + if (size == 0) + return 0; + + if (base) + dmem_reserve_region(base, size, align); + else + dmem_reserve_size(size, align, resv_kernel); + + return 0; +} From patchwork Mon Dec 7 11:30:56 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955333 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EA196C1B087 for ; Mon, 7 Dec 2020 11:34:27 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A3845233CE for ; Mon, 7 Dec 2020 11:34:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726874AbgLGLeO (ORCPT ); Mon, 7 Dec 2020 06:34:14 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40774 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726771AbgLGLeO (ORCPT ); Mon, 7 Dec 2020 06:34:14 -0500 Received: from mail-pf1-x441.google.com (mail-pf1-x441.google.com [IPv6:2607:f8b0:4864:20::441]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3AB16C0613D1; Mon, 7 Dec 2020 03:33:33 -0800 (PST) Received: by mail-pf1-x441.google.com with SMTP id 11so3388936pfu.4; Mon, 07 Dec 2020 03:33:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=wbJCpaqM/KGCLPuuObH8xrDvyt3YizKo4k6KtxXLqa8=; b=tYZhqaZKJCQqusFMQa99A+bjMfUXsJOlj+grQyGfmYBspUpcJnrDcHnEmK/ckzFusp z0G11NGmU7j3RpvEZRin2QCl3i0tZ48QP5UeonJvYMm6opUVpcRJW1Itn8vLgnl9Kjh7 pHd4U/R2oljBsV+xs1IeJvaEKFiNuh7u/GdZPTfx9XpwYniKuoCGAeI9WPLQD/9NKOOU pxfK2FgGWbMm0aitewEUUCRPb5keEAn4WTVKHuDYBo7MlODbnFZKIuaLGV7YERMhcyNG yO5kd7x9M6b7hQp0OFTSG56tjGhoV++kON4Am2ynECZV58+94FfsYxPTUrdqLk2S6iyf htUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=wbJCpaqM/KGCLPuuObH8xrDvyt3YizKo4k6KtxXLqa8=; b=UK7pkAfuhOMPi8IbxD1wk/6QgBOsloyufVonw65MX5V3+sKW+IQs7cNngzPSfH5OyL sRP99GrXAaeAxhvf0rc0jJgl/Sqt2H+DugVqddBsRMr5zEBFJKKrB7ys8zz1L9RgvsxK GE8jIDaVUsbEUOrqeMhUc52mdAIbzJ6IX+4CaHD/dcDt9X88z+qx3/fUezBnWhutWXcd tlULMje3lTvoTuAuZf/+dwflHfGTvD9ab31LVzsF56FgUKYitpp2yFmNFYh9y1c2YFVI Q/mLJds446BDfEetgQ9wGDlgBj8j3lsaNiXdgMoOdTmSHYtKUc2zTACRKfUQzuQJGfeR 2dnQ== X-Gm-Message-State: AOAM530shO3wryOPJV+msJig0ZoooMxswJTN64KFaXxw1QBylqbmrE0D J1+IUtDlmT4gauddlMomst4= X-Google-Smtp-Source: ABdhPJxECpmJH1pT5gWDW0HE1gPF5QZ0ODYFe7yvOkAEZTxOK5iFhSOocfBWRw/rsl1GSq62gRYoww== X-Received: by 2002:a17:902:6803:b029:d6:cf9d:2cfb with SMTP id h3-20020a1709026803b02900d6cf9d2cfbmr15556617plk.55.1607340812726; Mon, 07 Dec 2020 03:33:32 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.33.29 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:33:32 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [RFC V2 03/37] dmem: implement dmem memory management Date: Mon, 7 Dec 2020 19:30:56 +0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang The belowing figure shows the topology of dmem memory management, it reserves a few memory regions from the numa nodes, and in each region it leverages the bitmap to track the actual memory usage. +------+-+-------+---------+ | Node0| | ... | NodeN | +--/-\-+-+-------+---------+ / \ +---v----+v-----+----------+ |region 0| ... | region n | +--/---\-+------+----------+ / \ +-+v+------v-------+-+-+-+ | | | bitmap | | | | +-+-+--------------+-+-+-+ It introduces the interfaces to manage dmem pages that include: - dmem_region_register(), it registers the reserved memory to the dmem management system - dmem_alloc_init(), initiate dmem allocator, note the page size the allocator used isn't the same thing with the alignment used to reserve dmem memory - dmem_alloc_pages_vma() and dmem_free_pages() are the interfaces allocating and freeing dmem memory, multiple pages can be allocated at one time, but it should be power of two Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- include/linux/dmem.h | 3 + mm/dmem.c | 674 +++++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 677 insertions(+) diff --git a/include/linux/dmem.h b/include/linux/dmem.h index 5049322..476a82e 100644 --- a/include/linux/dmem.h +++ b/include/linux/dmem.h @@ -7,6 +7,9 @@ void dmem_init(void); int dmem_region_register(int node, phys_addr_t start, phys_addr_t end); +int dmem_alloc_init(unsigned long dpage_shift); +void dmem_alloc_uinit(void); + #else static inline int dmem_reserve_init(void) { diff --git a/mm/dmem.c b/mm/dmem.c index b5fb4f1..a77a064 100644 --- a/mm/dmem.c +++ b/mm/dmem.c @@ -91,11 +91,38 @@ struct dmem_pool { .lock = __MUTEX_INITIALIZER(dmem_pool.lock), }; +#define DMEM_PAGE_SIZE (1UL << dmem_pool.dpage_shift) +#define DMEM_PAGE_UP(x) phys_to_dpage(((x) + DMEM_PAGE_SIZE - 1)) +#define DMEM_PAGE_DOWN(x) phys_to_dpage(x) + +#define dpage_to_phys(_dpage) \ + ((_dpage) << dmem_pool.dpage_shift) +#define phys_to_dpage(_addr) \ + ((_addr) >> dmem_pool.dpage_shift) + +#define dpage_to_pfn(_dpage) \ + (__phys_to_pfn(dpage_to_phys(_dpage))) +#define pfn_to_dpage(_pfn) \ + (phys_to_dpage(__pfn_to_phys(_pfn))) + +#define dnode_to_nid(_dnode) \ + ((_dnode) - dmem_pool.nodes) +#define nid_to_dnode(nid) \ + (&dmem_pool.nodes[nid]) + #define for_each_dmem_node(_dnode) \ for (_dnode = dmem_pool.nodes; \ _dnode < dmem_pool.nodes + ARRAY_SIZE(dmem_pool.nodes); \ _dnode++) +#define for_each_dmem_region(_dnode, _dregion) \ + list_for_each_entry(_dregion, &(_dnode)->regions, node) + +static inline int *dmem_nodelist(int nid) +{ + return nid_to_dnode(nid)->nodelist; +} + void __init dmem_init(void) { struct dmem_node *dnode; @@ -135,3 +162,650 @@ int dmem_region_register(int node, phys_addr_t start, phys_addr_t end) return 0; } +#define PENALTY_FOR_DMEM_SHARED_NODE (1) + +static int dmem_nodeload[MAX_NUMNODES] __initdata; + +/* Evaluate penalty for each dmem node */ +static int __init dmem_evaluate_node(int local, int node) +{ + int penalty; + + /* Use the distance array to find the distance */ + penalty = node_distance(local, node); + + /* Penalize nodes under us ("prefer the next node") */ + penalty += (node < local); + + /* Give preference to headless and unused nodes */ + if (!cpumask_empty(cpumask_of_node(node))) + penalty += PENALTY_FOR_NODE_WITH_CPUS; + + /* Penalize dmem-node shared with kernel */ + if (node_state(node, N_MEMORY)) + penalty += PENALTY_FOR_DMEM_SHARED_NODE; + + /* Slight preference for less loaded node */ + penalty *= (nr_online_nodes * MAX_NUMNODES); + + penalty += dmem_nodeload[node]; + + return penalty; +} + +static int __init find_next_dmem_node(int local, nodemask_t *used_nodes) +{ + struct dmem_node *dnode; + int node, best_node = NUMA_NO_NODE; + int penalty, min_penalty = INT_MAX; + + /* Invalid node is not suitable to call node_distance */ + if (!node_state(local, N_POSSIBLE)) + return NUMA_NO_NODE; + + /* Use the local node if we haven't already */ + if (!node_isset(local, *used_nodes)) { + node_set(local, *used_nodes); + return local; + } + + for_each_dmem_node(dnode) { + if (list_empty(&dnode->regions)) + continue; + + node = dnode_to_nid(dnode); + + /* Don't want a node to appear more than once */ + if (node_isset(node, *used_nodes)) + continue; + + penalty = dmem_evaluate_node(local, node); + + if (penalty < min_penalty) { + min_penalty = penalty; + best_node = node; + } + } + + if (best_node >= 0) + node_set(best_node, *used_nodes); + + return best_node; +} + +static int __init dmem_node_init(struct dmem_node *dnode) +{ + int *nodelist; + nodemask_t used_nodes; + int local, node, prev; + int load; + int i = 0; + + nodelist = dnode->nodelist; + nodes_clear(used_nodes); + local = dnode_to_nid(dnode); + prev = local; + load = nr_online_nodes; + + while ((node = find_next_dmem_node(local, &used_nodes)) >= 0) { + /* + * We don't want to pressure a particular node. + * So adding penalty to the first node in same + * distance group to make it round-robin. + */ + if (node_distance(local, node) != node_distance(local, prev)) + dmem_nodeload[node] = load; + + nodelist[i++] = prev = node; + load--; + } + + return 0; +} + +static void __init dmem_region_uinit(struct dmem_region *dregion) +{ + unsigned long nr_pages, size, *bitmap = dregion->error_bitmap; + + if (!bitmap) + return; + + nr_pages = __phys_to_pfn(dregion->reserved_end_addr) + - __phys_to_pfn(dregion->reserved_start_addr); + + WARN_ON(!nr_pages); + + size = BITS_TO_LONGS(nr_pages) * sizeof(long); + if (size > sizeof(dregion->static_bitmap)) + kfree(bitmap); + dregion->error_bitmap = NULL; +} + +/* + * we only stop allocator to use the reserved page and do not + * reture pages back if anything goes wrong + */ +static void __init dmem_uinit(void) +{ + struct dmem_region *dregion, *dr; + struct dmem_node *dnode; + + for_each_dmem_node(dnode) { + dnode->nodelist[0] = NUMA_NO_NODE; + list_for_each_entry_safe(dregion, dr, &dnode->regions, node) { + dmem_region_uinit(dregion); + dregion->reserved_start_addr = + dregion->reserved_end_addr = 0; + list_del(&dregion->node); + } + } + + dmem_pool.region_num = 0; + dmem_pool.registered_pages = 0; +} + +static int __init dmem_region_init(struct dmem_region *dregion) +{ + unsigned long *bitmap, size, nr_pages; + + nr_pages = __phys_to_pfn(dregion->reserved_end_addr) + - __phys_to_pfn(dregion->reserved_start_addr); + + size = BITS_TO_LONGS(nr_pages) * sizeof(long); + if (size <= sizeof(dregion->static_error_bitmap)) { + bitmap = &dregion->static_error_bitmap; + } else { + bitmap = kzalloc(size, GFP_KERNEL); + if (!bitmap) + return -ENOMEM; + } + dregion->error_bitmap = bitmap; + return 0; +} + +/* + * dmem memory is not 'struct page' backend, i.e, the kernel threats + * it as invalid pfn + */ +static int __init dmem_check_region(struct dmem_region *dregion) +{ + unsigned long pfn; + + for (pfn = __phys_to_pfn(dregion->reserved_start_addr); + pfn < __phys_to_pfn(dregion->reserved_end_addr); pfn++) { + if (!WARN_ON(pfn_valid(pfn))) + continue; + + pr_err("dmem: check pfn %#lx failed, its memory was not properly reserved\n", + pfn); + return -EINVAL; + } + + return 0; +} + +static int __init dmem_late_init(void) +{ + struct dmem_region *dregion; + struct dmem_node *dnode; + int ret; + + for_each_dmem_node(dnode) { + dmem_node_init(dnode); + + for_each_dmem_region(dnode, dregion) { + ret = dmem_region_init(dregion); + if (ret) + goto exit; + ret = dmem_check_region(dregion); + if (ret) + goto exit; + } + } + return ret; +exit: + dmem_uinit(); + return ret; +} +late_initcall(dmem_late_init); + +static int dmem_alloc_region_init(struct dmem_region *dregion, + unsigned long *dpages) +{ + unsigned long start, end, *bitmap, size; + + start = DMEM_PAGE_UP(dregion->reserved_start_addr); + end = DMEM_PAGE_DOWN(dregion->reserved_end_addr); + + *dpages = end - start; + if (!*dpages) + return 0; + + size = BITS_TO_LONGS(*dpages) * sizeof(long); + if (size <= sizeof(dregion->static_bitmap)) + bitmap = &dregion->static_bitmap; + else { + bitmap = kzalloc(size, GFP_KERNEL); + if (!bitmap) + return -ENOMEM; + } + + dregion->bitmap = bitmap; + dregion->next_free_pos = 0; + dregion->dpage_start_pfn = start; + dregion->dpage_end_pfn = end; + + dmem_pool.unaligned_pages += __phys_to_pfn((dpage_to_phys(start) + - dregion->reserved_start_addr)); + dmem_pool.unaligned_pages += __phys_to_pfn(dregion->reserved_end_addr + - dpage_to_phys(end)); + return 0; +} + +static bool dmem_dpage_is_error(struct dmem_region *dregion, phys_addr_t dpage) +{ + unsigned long valid_pages; + unsigned long pos_pfn, pos_offset; + unsigned long pages_per_dpage = DMEM_PAGE_SIZE >> PAGE_SHIFT; + phys_addr_t reserved_start_pfn; + + reserved_start_pfn = __phys_to_pfn(dregion->reserved_start_addr); + valid_pages = dpage_to_pfn(dregion->dpage_end_pfn) - reserved_start_pfn; + + pos_offset = dpage_to_pfn(dpage) - reserved_start_pfn; + pos_pfn = find_next_bit(dregion->error_bitmap, valid_pages, pos_offset); + if (pos_pfn < pos_offset + pages_per_dpage) + return true; + return false; +} + +static unsigned long +dmem_alloc_bitmap_clear(struct dmem_region *dregion, phys_addr_t dpage, + unsigned int dpages_nr) +{ + u64 pos = dpage - dregion->dpage_start_pfn; + unsigned int i; + unsigned long err_num = 0; + + for (i = 0; i < dpages_nr; i++) { + if (dmem_dpage_is_error(dregion, dpage + i)) { + WARN_ON(!test_bit(pos + i, dregion->bitmap)); + err_num++; + } else { + WARN_ON(!__test_and_clear_bit(pos + i, + dregion->bitmap)); + } + } + return err_num; +} + +/* set or clear corresponding bit on allocation bitmap based on error bitmap */ +static unsigned long dregion_alloc_bitmap_set_clear(struct dmem_region *dregion, + bool set) +{ + unsigned long pos_pfn, pos_offset; + unsigned long valid_pages, mce_dpages = 0; + phys_addr_t dpage, reserved_start_pfn; + + reserved_start_pfn = __phys_to_pfn(dregion->reserved_start_addr); + + valid_pages = dpage_to_pfn(dregion->dpage_end_pfn) - reserved_start_pfn; + pos_offset = dpage_to_pfn(dregion->dpage_start_pfn) + - reserved_start_pfn; +try_set: + pos_pfn = find_next_bit(dregion->error_bitmap, valid_pages, pos_offset); + + if (pos_pfn >= valid_pages) + return mce_dpages; + mce_dpages++; + dpage = pfn_to_dpage(pos_pfn + reserved_start_pfn); + if (set) + WARN_ON(__test_and_set_bit(dpage - dregion->dpage_start_pfn, + dregion->bitmap)); + else + WARN_ON(!__test_and_clear_bit(dpage - dregion->dpage_start_pfn, + dregion->bitmap)); + pos_offset = dpage_to_pfn(dpage + 1) - reserved_start_pfn; + goto try_set; +} + +static void dmem_uinit_check_alloc_bitmap(struct dmem_region *dregion) +{ + unsigned long dpages, size; + + dregion_alloc_bitmap_set_clear(dregion, false); + + dpages = dregion->dpage_end_pfn - dregion->dpage_start_pfn; + size = BITS_TO_LONGS(dpages) * sizeof(long); + WARN_ON(!bitmap_empty(dregion->bitmap, size * BITS_PER_BYTE)); +} + +static void dmem_alloc_region_uinit(struct dmem_region *dregion) +{ + unsigned long dpages, size, *bitmap = dregion->bitmap; + + if (!bitmap) + return; + + dpages = dregion->dpage_end_pfn - dregion->dpage_start_pfn; + WARN_ON(!dpages); + + dmem_uinit_check_alloc_bitmap(dregion); + + size = BITS_TO_LONGS(dpages) * sizeof(long); + if (size > sizeof(dregion->static_bitmap)) + kfree(bitmap); + dregion->bitmap = NULL; +} + +static void __dmem_alloc_uinit(void) +{ + struct dmem_node *dnode; + struct dmem_region *dregion; + + if (!dmem_pool.dpage_shift) + return; + + dmem_pool.unaligned_pages = 0; + + for_each_dmem_node(dnode) { + for_each_dmem_region(dnode, dregion) + dmem_alloc_region_uinit(dregion); + + dnode->total_dpages = dnode->free_dpages = 0; + } + + dmem_pool.dpage_shift = 0; + dmem_pool.total_dpages = dmem_pool.free_dpages = 0; +} + +static void dnode_count_free_dpages(struct dmem_node *dnode, long dpages) +{ + dnode->free_dpages += dpages; + dmem_pool.free_dpages += dpages; +} + +/* + * uninitialize dmem allocator + * + * all dpages should be freed before calling it + */ +void dmem_alloc_uinit(void) +{ + mutex_lock(&dmem_pool.lock); + if (!--dmem_pool.user_count) + __dmem_alloc_uinit(); + mutex_unlock(&dmem_pool.lock); +} +EXPORT_SYMBOL(dmem_alloc_uinit); + +/* + * initialize dmem allocator + * @dpage_shift: the shift bits of dmem page size used to manange + * dmem memory, it should be CPU's nature page size at least + * + * Note: the page size the allocator used isn't the same thing with + * the alignment used to reserve dmem memory + */ +int dmem_alloc_init(unsigned long dpage_shift) +{ + struct dmem_node *dnode; + struct dmem_region *dregion; + unsigned long dpages; + int ret = 0; + + if (dpage_shift < PAGE_SHIFT) + return -EINVAL; + + mutex_lock(&dmem_pool.lock); + + if (dmem_pool.dpage_shift) { + /* + * double init on the same page size is okay + * to make the unit tests happy + */ + if (dmem_pool.dpage_shift != dpage_shift) + ret = -EBUSY; + + goto exit; + } + + dmem_pool.dpage_shift = dpage_shift; + + for_each_dmem_node(dnode) { + for_each_dmem_region(dnode, dregion) { + ret = dmem_alloc_region_init(dregion, &dpages); + if (ret < 0) { + __dmem_alloc_uinit(); + goto exit; + } + + dnode_count_free_dpages(dnode, dpages); + } + dnode->total_dpages = dnode->free_dpages; + } + + dmem_pool.total_dpages = dmem_pool.free_dpages; + + if (dmem_pool.unaligned_pages && !ret) + pr_warn("dmem: %llu pages are wasted due to alignment\n", + (unsigned long long)dmem_pool.unaligned_pages); +exit: + if (!ret) + dmem_pool.user_count++; + + mutex_unlock(&dmem_pool.lock); + return ret; +} +EXPORT_SYMBOL(dmem_alloc_init); + +static phys_addr_t +dmem_alloc_region_page(struct dmem_region *dregion, unsigned int try_max, + unsigned int *result_nr) +{ + unsigned long pos, dpages; + unsigned int i; + + /* no dpage is available in this region */ + if (!dregion->bitmap) + return 0; + + dpages = dregion->dpage_end_pfn - dregion->dpage_start_pfn; + + /* no free page in this region */ + if (dregion->next_free_pos >= dpages) + return 0; + + pos = find_next_zero_bit(dregion->bitmap, dpages, + dregion->next_free_pos); + if (pos >= dpages) { + dregion->next_free_pos = pos; + return 0; + } + + __set_bit(pos, dregion->bitmap); + + /* do not go beyond the region */ + try_max = min(try_max, (unsigned int)(dpages - pos - 1)); + for (i = 1; i < try_max; i++) + if (__test_and_set_bit(pos + i, dregion->bitmap)) + break; + + *result_nr = i; + dregion->next_free_pos = pos + *result_nr; + return dpage_to_phys(dregion->dpage_start_pfn + pos); +} + +/* + * allocate dmem pages from the nodelist + * + * @nodelist: dmem_node's nodelist + * @nodemask: nodemask for filtering the dmem nodelist + * @try_max: try to allocate @try_max dpages if possible + * @result_nr: allocated dpage number returned to the caller + * + * return the physical address of the first dpage allocated from dmem + * pool, or 0 on failure. The allocated dpage number is filled into + * @result_nr + */ +static phys_addr_t +dmem_alloc_pages_from_nodelist(int *nodelist, nodemask_t *nodemask, + unsigned int try_max, unsigned int *result_nr) +{ + struct dmem_node *dnode; + struct dmem_region *dregion; + phys_addr_t addr = 0; + int node, i; + unsigned int local_result_nr; + + WARN_ON(try_max > 1 && !result_nr); + + if (!result_nr) + result_nr = &local_result_nr; + + *result_nr = 0; + + for (i = 0; !addr && i < ARRAY_SIZE(dnode->nodelist); i++) { + node = nodelist[i]; + + if (nodemask && !node_isset(node, *nodemask)) + continue; + + mutex_lock(&dmem_pool.lock); + + WARN_ON(!dmem_pool.dpage_shift); + + dnode = &dmem_pool.nodes[node]; + for_each_dmem_region(dnode, dregion) { + addr = dmem_alloc_region_page(dregion, try_max, + result_nr); + if (addr) { + dnode_count_free_dpages(dnode, + -(long)(*result_nr)); + break; + } + } + + mutex_unlock(&dmem_pool.lock); + } + return addr; +} + +/* + * allocate a dmem page from the dmem pool and try to allocate more + * continuous dpages if @try_max is not less than 1 + * + * @nid: the NUMA node the dmem page got from + * @nodemask: nodemask for filtering the dmem nodelist + * @try_max: try to allocate @try_max dpages if possible + * @result_nr: allocated dpage number returned to the caller + * + * return the physical address of the first dpage allocated from dmem + * pool, or 0 on failure. The allocated dpage number is filled into + * @result_nr + */ +phys_addr_t +dmem_alloc_pages_nodemask(int nid, nodemask_t *nodemask, unsigned int try_max, + unsigned int *result_nr) +{ + int *nodelist; + + if (nid >= sizeof(ARRAY_SIZE(dmem_pool.nodes))) + return 0; + + nodelist = dmem_nodelist(nid); + return dmem_alloc_pages_from_nodelist(nodelist, nodemask, + try_max, result_nr); +} +EXPORT_SYMBOL(dmem_alloc_pages_nodemask); + +/* + * dmem_alloc_pages_vma - Allocate pages for a VMA. + * + * @vma: Pointer to VMA or NULL if not available. + * @addr: Virtual Address of the allocation. Must be inside the VMA. + * @try_max: try to allocate @try_max dpages if possible + * @result_nr: allocated dpage number returned to the caller + * + * Return the physical address of the first dpage allocated from dmem + * pool, or 0 on failure. The allocated dpage number is filled into + * @result_nr + */ +phys_addr_t +dmem_alloc_pages_vma(struct vm_area_struct *vma, unsigned long addr, + unsigned int try_max, unsigned int *result_nr) +{ + phys_addr_t phys_addr; + int *nl; + unsigned int cpuset_mems_cookie; + +retry_cpuset: + nl = dmem_nodelist(numa_node_id()); + + phys_addr = dmem_alloc_pages_from_nodelist(nl, NULL, try_max, + result_nr); + if (unlikely(!phys_addr && read_mems_allowed_retry(cpuset_mems_cookie))) + goto retry_cpuset; + + return phys_addr; +} +EXPORT_SYMBOL(dmem_alloc_pages_vma); + +/* + * Don't need to call it in a lock. + * This function uses the reserved addresses those are initially registered + * and will not be modified at run time. + */ +static struct dmem_region *find_dmem_region(phys_addr_t phys_addr, + struct dmem_node **pdnode) +{ + struct dmem_node *dnode; + struct dmem_region *dregion; + + for_each_dmem_node(dnode) + for_each_dmem_region(dnode, dregion) { + if (dregion->reserved_start_addr > phys_addr) + continue; + if (dregion->reserved_end_addr <= phys_addr) + continue; + + *pdnode = dnode; + return dregion; + } + + return NULL; +} + +/* + * free dmem page to the dmem pool + * @addr: the physical addree will be freed + * @dpage_nr: the number of dpage to be freed + */ +void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr) +{ + struct dmem_region *dregion; + struct dmem_node *pdnode = NULL; + phys_addr_t dpage = phys_to_dpage(addr); + u64 pos; + unsigned long err_dpages; + + mutex_lock(&dmem_pool.lock); + + WARN_ON(!dmem_pool.dpage_shift); + + dregion = find_dmem_region(addr, &pdnode); + WARN_ON(!dregion || !dregion->bitmap || !pdnode); + + pos = dpage - dregion->dpage_start_pfn; + dregion->next_free_pos = min(dregion->next_free_pos, pos); + + /* it is not possible to span multiple regions */ + WARN_ON(dpage + dpages_nr - 1 >= dregion->dpage_end_pfn); + + err_dpages = dmem_alloc_bitmap_clear(dregion, dpage, dpages_nr); + + dnode_count_free_dpages(pdnode, dpages_nr - err_dpages); + mutex_unlock(&dmem_pool.lock); +} +EXPORT_SYMBOL(dmem_free_pages); + From patchwork Mon Dec 7 11:30:57 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955335 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 58C93C1B0E3 for ; Mon, 7 Dec 2020 11:34:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2433F233FA for ; Mon, 7 Dec 2020 11:34:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726897AbgLGLeT (ORCPT ); Mon, 7 Dec 2020 06:34:19 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40784 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726893AbgLGLeR (ORCPT ); Mon, 7 Dec 2020 06:34:17 -0500 Received: from mail-pf1-x442.google.com (mail-pf1-x442.google.com [IPv6:2607:f8b0:4864:20::442]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 82586C0613D3; Mon, 7 Dec 2020 03:33:37 -0800 (PST) Received: by mail-pf1-x442.google.com with SMTP id q22so9587803pfk.12; Mon, 07 Dec 2020 03:33:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=V7IqDWSKXukQODvyRYxxHyAp9kzYnONBivG3E1uSTi0=; b=YIMPqFee/AenZi8qBNwZ1evrJB2ILU5DVBLCXYmgkgMvkajLVi78Jv6Z/OiFi2nfW1 u/YJnlOgZeEA70FaDcZFIcf31FvalQEUazAyBN7IUyWDay51dK1/1zqafd3IwqnePSII f4iij588EBRDCzP9KPiO0aTvaS4APpif6lMHhKyytak+ZjMZ3sxH2BKJEPMaTDsyFeLu 9kb1y8rVVQPPxQW+2Eb5TjVDa9Jv4XiM6S+q6Lfic5FL6MuReKSh5B0cZHiDIXHxmLiB QsCC+kgZIWtG6Zl3CQdJd/Yz+V47c7WU3yfJrs9g+0W1tVoMSSFSOz4Y4OTMBNn32GVd N0JQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=V7IqDWSKXukQODvyRYxxHyAp9kzYnONBivG3E1uSTi0=; b=R5PzOt071SIFuEPzEBrsUqZOtJaaQXBoFiZfK/VsM9XB2Okhvr/mOr0wMtSzgUKHmY 8YRoqxInvGfOylOLw6Pk4fEdY1ZEU0Z8qk5iD/SR6vcQov9narVwc4xtqSMZ2MjOQDL5 IBZmdjxAmQxRxXi8+3ojTsbpRY7D55MLI5I8ES/Mb43ISRESKnqfTW7mP/ajMqrKEHGB /TlowNgQaK57OU3OinNFxDD1Jvl13JygJjbO8/OzhiUVukCgjzh8weA47JyTFU1+d9Gq CwnYGnG+nJ48X9JTvDKB04aFwe7s+sZsoSjbMRsIaqUM5HlTsrb3j6e5HkrlXC9JV8b7 eTpw== X-Gm-Message-State: AOAM530mNBx94VD9YO8dRkXdQiB4YeyFE8/wJqnVXxAzeEU279dBbFIU Gf2YN0TBpGZG2RBFEKZg/58= X-Google-Smtp-Source: ABdhPJyQefdtFNDvBzo/WKmc11oLIwPzRy//V+/lSFKCStAywDqYn6pKN5mcS5ZLbN4k0Rx9dBfgRA== X-Received: by 2002:a17:902:9341:b029:da:13f5:302a with SMTP id g1-20020a1709029341b02900da13f5302amr15670304plp.9.1607340817145; Mon, 07 Dec 2020 03:33:37 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.33.33 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:33:36 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [RFC V2 04/37] dmem: let pat recognize dmem Date: Mon, 7 Dec 2020 19:30:57 +0800 Message-Id: <805999e57d629348f813017e02a086e33e507d9e.1607332046.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang x86 pat uses 'struct page' by only checking if it's system ram, however it is not true if dmem is used, let's teach pat to recognize this case if it is ram but it is !pfn_valid() We always use WB for dmem and any attempt to change this behavior will be rejected and WARN_ON is triggered Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- arch/x86/mm/pat/memtype.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c index 8f665c3..fd8a298 100644 --- a/arch/x86/mm/pat/memtype.c +++ b/arch/x86/mm/pat/memtype.c @@ -511,6 +511,13 @@ static int reserve_ram_pages_type(u64 start, u64 end, for (pfn = (start >> PAGE_SHIFT); pfn < (end >> PAGE_SHIFT); ++pfn) { enum page_cache_mode type; + /* + * it's dmem if it's ram but not 'struct page' backend, + * we always use WB + */ + if (WARN_ON(!pfn_valid(pfn))) + return -EBUSY; + page = pfn_to_page(pfn); type = get_page_memtype(page); if (type != _PAGE_CACHE_MODE_WB) { @@ -539,6 +546,13 @@ static int free_ram_pages_type(u64 start, u64 end) u64 pfn; for (pfn = (start >> PAGE_SHIFT); pfn < (end >> PAGE_SHIFT); ++pfn) { + /* + * it's dmem, see the comments in + * reserve_ram_pages_type() + */ + if (WARN_ON(!pfn_valid(pfn))) + continue; + page = pfn_to_page(pfn); set_page_memtype(page, _PAGE_CACHE_MODE_WB); } @@ -714,6 +728,13 @@ static enum page_cache_mode lookup_memtype(u64 paddr) if (pat_pagerange_is_ram(paddr, paddr + PAGE_SIZE)) { struct page *page; + /* + * dmem always uses WB, see the comments in + * reserve_ram_pages_type() + */ + if (!pfn_valid(paddr >> PAGE_SHIFT)) + return rettype; + page = pfn_to_page(paddr >> PAGE_SHIFT); return get_page_memtype(page); } From patchwork Mon Dec 7 11:30:58 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955337 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C43D1C2BBCA for ; Mon, 7 Dec 2020 11:34:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9EA16233CF for ; Mon, 7 Dec 2020 11:34:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726928AbgLGLeX (ORCPT ); Mon, 7 Dec 2020 06:34:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40798 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726893AbgLGLeW (ORCPT ); Mon, 7 Dec 2020 06:34:22 -0500 Received: from mail-pf1-x444.google.com (mail-pf1-x444.google.com [IPv6:2607:f8b0:4864:20::444]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D3DE4C0613D4; Mon, 7 Dec 2020 03:33:41 -0800 (PST) Received: by mail-pf1-x444.google.com with SMTP id c79so9613551pfc.2; Mon, 07 Dec 2020 03:33:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=/44otMrMxzCUiGk0V1t2L81SCVx385WHUCyFll3f9p0=; b=EiialdWRB3FHs1vDA+YkzfnQD6VJz6N4qmPV73L07sHuG/P4ErRRe6bm61f9g5hr9H +bV+1omLT2lXgkipsrER7t4rQ/QroKwlhp9ASMHS41/YxSwuudP0PtPd/urhOkA/EO9+ TiWzoGcwtXpte4GC/vTXZi503fNXbWeHFok7IbUc5PmkUX4gvlyg1b9Q+jbOSz3ZunLJ gdl947A+Eh0NIoZr03LVyIBhMu8U/PPJJg7YRgXc0swqDOVuRoWZ1LhJGdtkIFQaMPvG Dygv6S2XVQrEGFMtvhF+fVgdh1BjHZ9kopYhtI9kkXo7XYMujVSmvjq+i89HNmdDCg1L X81A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=/44otMrMxzCUiGk0V1t2L81SCVx385WHUCyFll3f9p0=; b=Vkglmiu6dKagSE4A+La5alVrp6wuSwVV8girSmYUpLGa5Fty8XZZwxdt8kx/bY+Xw/ NSVJshkZXQrh0FyY19edMxrYw+L+30A5SBZQnAu8aBV0J7ubJmzy32lB8IJZPa4Xwzq+ /eGyj2okiBEKcONc/ruOOseTLrgOZCT/Srgef3zGAsAf5ojF6XpW2Yb594nQl+Es9Pma xlmYmarAvmnCaqOB9xpfRbyFjYUWI/PMhdQ9Pad17A/ffDecOiZ8OJacZcKImQ1We+ld bona/H8uEtoBRLx1Qwf3yK6GQ40YwkIt9NPdX0VLoGQZugfvuyeyfh9kd5iY5oitTcCu 5gXw== X-Gm-Message-State: AOAM530yJxuJJIfGc5NLhlF51XCgoTyJDpcSl+2imwmzBqIv6/V/WvKG 6syVMST3M881IB/uAC4Eb0A= X-Google-Smtp-Source: ABdhPJxsuCuuWYmFuc/z0htdn8lG7NF7yPaaVzD61TY34e7sLxSOK8H64w92t88zlRNSfCJLilKEiQ== X-Received: by 2002:aa7:8b15:0:b029:196:59ad:ab93 with SMTP id f21-20020aa78b150000b029019659adab93mr15263676pfd.16.1607340821419; Mon, 07 Dec 2020 03:33:41 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.33.38 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:33:40 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [RFC V2 05/37] dmemfs: support mmap for dmemfs Date: Mon, 7 Dec 2020 19:30:58 +0800 Message-Id: <556903717e3d0b0fc0b9583b709f4b34be2154cb.1607332046.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang This patch adds mmap support. Note the file will be extended if it's beyond mmap's offset, that drops the requirement of write() operation, however, it has not supported cutting file down yet. Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- fs/dmemfs/inode.c | 343 ++++++++++++++++++++++++++++++++++++++++++++++++++- include/linux/dmem.h | 10 ++ 2 files changed, 351 insertions(+), 2 deletions(-) diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index 0aa3d3b..7b6e51d 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -26,6 +26,7 @@ #include #include #include +#include MODULE_AUTHOR("Tencent Corporation"); MODULE_LICENSE("GPL v2"); @@ -102,7 +103,255 @@ static int dmemfs_mkdir(struct inode *dir, struct dentry *dentry, .getattr = simple_getattr, }; +static unsigned long dmem_pgoff_to_index(struct inode *inode, pgoff_t pgoff) +{ + struct super_block *sb = inode->i_sb; + + return pgoff >> (sb->s_blocksize_bits - PAGE_SHIFT); +} + +static void *dmem_addr_to_entry(struct inode *inode, phys_addr_t addr) +{ + struct super_block *sb = inode->i_sb; + + addr >>= sb->s_blocksize_bits; + return xa_mk_value(addr); +} + +static phys_addr_t dmem_entry_to_addr(struct inode *inode, void *entry) +{ + struct super_block *sb = inode->i_sb; + + WARN_ON(!xa_is_value(entry)); + return xa_to_value(entry) << sb->s_blocksize_bits; +} + +static unsigned long +dmem_addr_to_pfn(struct inode *inode, phys_addr_t addr, pgoff_t pgoff, + unsigned int fault_shift) +{ + struct super_block *sb = inode->i_sb; + unsigned long pfn = addr >> PAGE_SHIFT; + unsigned long mask; + + mask = (1UL << ((unsigned int)sb->s_blocksize_bits - fault_shift)) - 1; + mask <<= fault_shift - PAGE_SHIFT; + + return pfn + (pgoff & mask); +} + +static inline unsigned long dmem_page_size(struct inode *inode) +{ + return inode->i_sb->s_blocksize; +} + +static int check_inode_size(struct inode *inode, loff_t offset) +{ + WARN_ON_ONCE(!rcu_read_lock_held()); + + if (offset >= i_size_read(inode)) + return -EINVAL; + + return 0; +} + +static unsigned +dmemfs_find_get_entries(struct address_space *mapping, unsigned long start, + unsigned int nr_entries, void **entries, + unsigned long *indices) +{ + XA_STATE(xas, &mapping->i_pages, start); + + void *entry; + unsigned int ret = 0; + + if (!nr_entries) + return 0; + + rcu_read_lock(); + + xas_for_each(&xas, entry, ULONG_MAX) { + if (xas_retry(&xas, entry)) + continue; + + if (xa_is_value(entry)) + goto export; + + if (unlikely(entry != xas_reload(&xas))) + goto retry; + +export: + indices[ret] = xas.xa_index; + entries[ret] = entry; + if (++ret == nr_entries) + break; + continue; +retry: + xas_reset(&xas); + } + rcu_read_unlock(); + return ret; +} + +static void *find_radix_entry_or_next(struct address_space *mapping, + unsigned long start, + unsigned long *eindex) +{ + void *entry = NULL; + + dmemfs_find_get_entries(mapping, start, 1, &entry, eindex); + return entry; +} + +/* + * find the entry in radix tree based on @index, create it if + * it does not exist + * + * return the entry with rcu locked, otherwise ERR_PTR() + * is returned + */ +static void * +radix_get_create_entry(struct vm_area_struct *vma, unsigned long fault_addr, + struct inode *inode, pgoff_t pgoff) +{ + struct address_space *mapping = inode->i_mapping; + unsigned long eindex, index; + loff_t offset; + phys_addr_t addr; + gfp_t gfp_masks = mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM; + void *entry; + unsigned int try_dpages, dpages; + int ret; + +retry: + offset = ((loff_t)pgoff << PAGE_SHIFT); + index = dmem_pgoff_to_index(inode, pgoff); + rcu_read_lock(); + ret = check_inode_size(inode, offset); + if (ret) { + rcu_read_unlock(); + return ERR_PTR(ret); + } + + try_dpages = dmem_pgoff_to_index(inode, (i_size_read(inode) - offset) + >> PAGE_SHIFT); + entry = find_radix_entry_or_next(mapping, index, &eindex); + if (entry) { + WARN_ON(!xa_is_value(entry)); + if (eindex == index) + return entry; + + WARN_ON(eindex <= index); + try_dpages = eindex - index; + } + rcu_read_unlock(); + + /* entry does not exist, create it */ + addr = dmem_alloc_pages_vma(vma, fault_addr, try_dpages, &dpages); + if (!addr) { + /* + * do not return -ENOMEM as that will trigger OOM, + * it is useless for reclaiming dmem page + */ + ret = -EINVAL; + goto exit; + } + + try_dpages = dpages; + while (dpages) { + rcu_read_lock(); + ret = check_inode_size(inode, offset); + if (ret) + goto unlock_rcu; + + entry = dmem_addr_to_entry(inode, addr); + entry = xa_store(&mapping->i_pages, index, entry, gfp_masks); + if (!xa_is_err(entry)) { + addr += inode->i_sb->s_blocksize; + offset += inode->i_sb->s_blocksize; + dpages--; + mapping->nrexceptional++; + index++; + } + +unlock_rcu: + rcu_read_unlock(); + if (ret) + break; + } + + if (dpages) + dmem_free_pages(addr, dpages); + + /* we have created some entries, let's retry it */ + if (ret == -EEXIST || try_dpages != dpages) + goto retry; +exit: + return ERR_PTR(ret); +} + +static void radix_put_entry(void) +{ + rcu_read_unlock(); +} + +static vm_fault_t dmemfs_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct inode *inode = file_inode(vma->vm_file); + phys_addr_t addr; + void *entry; + int ret; + + if (vmf->pgoff > (MAX_LFS_FILESIZE >> PAGE_SHIFT)) + return VM_FAULT_SIGBUS; + + entry = radix_get_create_entry(vma, (unsigned long)vmf->address, + inode, vmf->pgoff); + if (IS_ERR(entry)) { + ret = PTR_ERR(entry); + goto exit; + } + + addr = dmem_entry_to_addr(inode, entry); + ret = vmf_insert_pfn(vma, (unsigned long)vmf->address, + dmem_addr_to_pfn(inode, addr, vmf->pgoff, + PAGE_SHIFT)); + radix_put_entry(); + +exit: + return ret; +} + +static unsigned long dmemfs_pagesize(struct vm_area_struct *vma) +{ + return dmem_page_size(file_inode(vma->vm_file)); +} + +static const struct vm_operations_struct dmemfs_vm_ops = { + .fault = dmemfs_fault, + .pagesize = dmemfs_pagesize, +}; + +int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct inode *inode = file_inode(file); + + if (vma->vm_pgoff & ((dmem_page_size(inode) - 1) >> PAGE_SHIFT)) + return -EINVAL; + + if (!(vma->vm_flags & VM_SHARED)) + return -EINVAL; + + vma->vm_flags |= VM_PFNMAP; + + file_accessed(file); + vma->vm_ops = &dmemfs_vm_ops; + return 0; +} + static const struct file_operations dmemfs_file_operations = { + .mmap = dmemfs_file_mmap, }; static int dmemfs_parse_param(struct fs_context *fc, struct fs_parameter *param) @@ -180,9 +429,86 @@ static int dmemfs_statfs(struct dentry *dentry, struct kstatfs *buf) return 0; } +/* + * should make sure the dmem page in the dropped region is not + * being mapped by any process + */ +static void inode_drop_dpages(struct inode *inode, loff_t start, loff_t end) +{ + struct address_space *mapping = inode->i_mapping; + struct pagevec pvec; + unsigned long istart, iend, indices[PAGEVEC_SIZE]; + int i; + + /* we never use normap page */ + WARN_ON(mapping->nrpages); + + /* if no dpage is allocated for the inode */ + if (!mapping->nrexceptional) + return; + + istart = dmem_pgoff_to_index(inode, start >> PAGE_SHIFT); + iend = dmem_pgoff_to_index(inode, end >> PAGE_SHIFT); + pagevec_init(&pvec); + while (istart < iend) { + pvec.nr = dmemfs_find_get_entries(mapping, istart, + min(iend - istart, + (unsigned long)PAGEVEC_SIZE), + (void **)pvec.pages, + indices); + if (!pvec.nr) + break; + + for (i = 0; i < pagevec_count(&pvec); i++) { + phys_addr_t addr; + + istart = indices[i]; + if (istart >= iend) + break; + + xa_erase(&mapping->i_pages, istart); + mapping->nrexceptional--; + + addr = dmem_entry_to_addr(inode, pvec.pages[i]); + dmem_free_page(addr); + } + + /* + * only exception entries in pagevec, it's safe to + * reinit it + */ + pagevec_reinit(&pvec); + cond_resched(); + istart++; + } +} + +static void dmemfs_evict_inode(struct inode *inode) +{ + /* no VMA works on it */ + WARN_ON(!RB_EMPTY_ROOT(&inode->i_data.i_mmap.rb_root)); + + inode_drop_dpages(inode, 0, LLONG_MAX); + clear_inode(inode); +} + +/* + * Display the mount options in /proc/mounts. + */ +static int dmemfs_show_options(struct seq_file *m, struct dentry *root) +{ + struct dmemfs_fs_info *fsi = root->d_sb->s_fs_info; + + if (check_dpage_size(fsi->mount_opts.dpage_size)) + seq_printf(m, ",pagesize=%lx", fsi->mount_opts.dpage_size); + return 0; +} + static const struct super_operations dmemfs_ops = { .statfs = dmemfs_statfs, + .evict_inode = dmemfs_evict_inode, .drop_inode = generic_delete_inode, + .show_options = dmemfs_show_options, }; static int @@ -190,6 +516,7 @@ static int dmemfs_statfs(struct dentry *dentry, struct kstatfs *buf) { struct inode *inode; struct dmemfs_fs_info *fsi = sb->s_fs_info; + int ret; sb->s_maxbytes = MAX_LFS_FILESIZE; sb->s_blocksize = fsi->mount_opts.dpage_size; @@ -198,11 +525,17 @@ static int dmemfs_statfs(struct dentry *dentry, struct kstatfs *buf) sb->s_op = &dmemfs_ops; sb->s_time_gran = 1; + ret = dmem_alloc_init(sb->s_blocksize_bits); + if (ret) + return ret; + inode = dmemfs_get_inode(sb, NULL, S_IFDIR); sb->s_root = d_make_root(inode); - if (!sb->s_root) - return -ENOMEM; + if (!sb->s_root) { + dmem_alloc_uinit(); + return -ENOMEM; + } return 0; } @@ -238,7 +571,13 @@ int dmemfs_init_fs_context(struct fs_context *fc) static void dmemfs_kill_sb(struct super_block *sb) { + bool has_inode = !!sb->s_root; + kill_litter_super(sb); + + /* do not uninit dmem allocator if mount failed */ + if (has_inode) + dmem_alloc_uinit(); } static struct file_system_type dmemfs_fs_type = { diff --git a/include/linux/dmem.h b/include/linux/dmem.h index 476a82e..8682d63 100644 --- a/include/linux/dmem.h +++ b/include/linux/dmem.h @@ -10,6 +10,16 @@ int dmem_alloc_init(unsigned long dpage_shift); void dmem_alloc_uinit(void); +phys_addr_t +dmem_alloc_pages_nodemask(int nid, nodemask_t *nodemask, unsigned int try_max, + unsigned int *result_nr); + +phys_addr_t +dmem_alloc_pages_vma(struct vm_area_struct *vma, unsigned long addr, + unsigned int try_max, unsigned int *result_nr); + +void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr); +#define dmem_free_page(addr) dmem_free_pages(addr, 1) #else static inline int dmem_reserve_init(void) { From patchwork Mon Dec 7 11:30:59 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955339 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B0650C4167B for ; Mon, 7 Dec 2020 11:35:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 67824233A1 for ; Mon, 7 Dec 2020 11:35:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726980AbgLGLee (ORCPT ); Mon, 7 Dec 2020 06:34:34 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40810 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726198AbgLGLeb (ORCPT ); Mon, 7 Dec 2020 06:34:31 -0500 Received: from mail-pj1-x1041.google.com (mail-pj1-x1041.google.com [IPv6:2607:f8b0:4864:20::1041]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D4D93C061A4F; Mon, 7 Dec 2020 03:33:45 -0800 (PST) Received: by mail-pj1-x1041.google.com with SMTP id e5so7190667pjt.0; Mon, 07 Dec 2020 03:33:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=eX1IR50aPFiEwdJQV5xGne8H/FATcvY6dG1G5D19yRg=; b=eXbX382qovy7COikmy0KYqgAQma7SSrycd5mwQ9J3QDSquh0VXtGUzblKKkJuAAuVI PHSVAom05MaCyxQFaVh4X0uD1ku5P8agDSw+IP8A/Cg85jWOn4rl+hXVuErT69iZK05Z AcBFV5C+p5lDBPuGmBkyAuL5TR4/MK5NGKvZAxWYLkqPklUmSTDRcAnTp3+KTbgnYNva U4uPsEDUiXFj3Bp/DPp8dqvmHNJAQNKovFWzqT1K4nBuHDOyV8p/E0E60YvnLz8urUmn JGkze+2OWmWXsKpBGQx87ndhmBmG+X4sOrdozDV9Gp1n+V+72T9SRvDWV3Jt7aMnyj1a mpgw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=eX1IR50aPFiEwdJQV5xGne8H/FATcvY6dG1G5D19yRg=; b=GV2AUCqqcmJCDIZN7oxC7Z5rxj9jQftaZJOKpqd/FquFYbfFsu5wwbE3bi2i70xU2d 4oSKtBqYIwa2P5RBd07muEb8kxnwhs7aPvGFdYNpbD1b0dMIOjNl0eajVqGB+RKVfrTw wj6PLkx2HOFCl1ZZncreWqeeKY5+C6Ly6rTz/ESa80TTz/3SCenohFKC5/VQ1B0KJ0HS +gBBii/etku1nwjm/idmDNbwr/kB7I74AJTKDsGyiL8nAYC1L48a2RnvSbzmkOYJp6C4 gaHWBVj0/W8zG3h2HCRmNuXI0KKDkcPH9+rvaLgfqDo2n5SIqkrtjhXnho78fMWOTZ3j xEHQ== X-Gm-Message-State: AOAM531G9O+51VXW+8Oly0TiEeS4d1eHLZi+wmwcmtLTgCCNPGoZ93Xs Z+cHPPoAEwKco/RXxz6Qckw= X-Google-Smtp-Source: ABdhPJwWiUXd//0TI6YPXtELG5n3boi+qEdZaWxYAMBrmQrPfGNew3+nfNYMRr95hadFZhb2SRQXRw== X-Received: by 2002:a17:90a:b38d:: with SMTP id e13mr16560118pjr.214.1607340825515; Mon, 07 Dec 2020 03:33:45 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.33.42 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:33:45 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [RFC V2 06/37] dmemfs: support truncating inode down Date: Mon, 7 Dec 2020 19:30:59 +0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang To support cut inode down, it will introduce the race between page fault handler and truncating handler as the entry to be deleted is being mapped into process's VMA in order to make page fault faster (as it's the hot path), we use RCU to sync these two handlers. When inode's size is updated, the handler makes sure the new size is visible to page fault handler who will not use truncated entry anymore and will not create new entry in that region Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- fs/dmemfs/inode.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 66 insertions(+), 1 deletion(-) diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index 7b6e51d..9ec62dc 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -98,8 +98,73 @@ static int dmemfs_mkdir(struct inode *dir, struct dentry *dentry, .rename = simple_rename, }; +static void inode_drop_dpages(struct inode *inode, loff_t start, loff_t end); + +static int dmemfs_truncate(struct inode *inode, loff_t newsize) +{ + struct super_block *sb = inode->i_sb; + loff_t current_size; + + if (newsize & ((1 << sb->s_blocksize_bits) - 1)) + return -EINVAL; + + current_size = i_size_read(inode); + i_size_write(inode, newsize); + + if (newsize >= current_size) + return 0; + + /* it cuts the inode down */ + + /* + * we should make sure inode->i_size has been updated before + * unmapping and dropping radix entries, so that other sides + * can not create new i_mapping entry beyond inode->i_size + * and the radix entry in the truncated region is not being + * used + * + * see the comments in dmemfs_fault() + */ + synchronize_rcu(); + + /* + * should unmap all mapping first as dmem pages are freed in + * inode_drop_dpages() + * + * after that, dmem page in the truncated region is not used + * by any process + */ + unmap_mapping_range(inode->i_mapping, newsize, 0, 1); + + inode_drop_dpages(inode, newsize, LLONG_MAX); + return 0; +} + +/* + * same logic as simple_setattr but we need to handle ftruncate + * carefully as we inserted self-defined entry into radix tree + */ +static int dmemfs_setattr(struct dentry *dentry, struct iattr *iattr) +{ + struct inode *inode = dentry->d_inode; + int error; + + error = setattr_prepare(dentry, iattr); + if (error) + return error; + + if (iattr->ia_valid & ATTR_SIZE) { + error = dmemfs_truncate(inode, iattr->ia_size); + if (error) + return error; + } + setattr_copy(inode, iattr); + mark_inode_dirty(inode); + return 0; +} + static const struct inode_operations dmemfs_file_inode_operations = { - .setattr = simple_setattr, + .setattr = dmemfs_setattr, .getattr = simple_getattr, }; From patchwork Mon Dec 7 11:31:00 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955525 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9E3BCC433FE for ; Mon, 7 Dec 2020 11:38:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 49C752333F for ; Mon, 7 Dec 2020 11:38:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726954AbgLGLea (ORCPT ); Mon, 7 Dec 2020 06:34:30 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40822 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726198AbgLGLea (ORCPT ); Mon, 7 Dec 2020 06:34:30 -0500 Received: from mail-pg1-x542.google.com (mail-pg1-x542.google.com [IPv6:2607:f8b0:4864:20::542]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 062DFC0613D2; Mon, 7 Dec 2020 03:33:50 -0800 (PST) Received: by mail-pg1-x542.google.com with SMTP id 69so803623pgg.8; Mon, 07 Dec 2020 03:33:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=QMkfGfUr8fj2hO5B5jLMQa5h9UySfQVQPVhyn3NCxd4=; b=lqKRhFOEdKV0wC0Ujvy8bkoj7sCr3aw0Z4/HDK77hQNPjJPNnVzFT7Y48IfRoK5fje 8Z3ax4VrbD+Dn0BhRTbgr8ogTL0zkSdpIuruZQ3+yy6Zr2XBsPy5eyrUcW0mVCr0/3bn qx+pBlAKJUjm/UklANHSReU2TCJRzqIA73fT6ogzAbRptOgSwK1+51QW8zzSqVkyvnwA lDTHfoho00Iof1lZ3QDDZkSwQBzQVJSl1+9+lTUfJcA7vzOvWze/xtSffZ8kB0THXJWq CodT0//aOcvnyDJGfu7Gte5EhdswKxOh89d3sLKX4/56IVsr8+P70YCf92dYnF/ATCXH EKEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=QMkfGfUr8fj2hO5B5jLMQa5h9UySfQVQPVhyn3NCxd4=; b=PcyO/5yUR9A8cJbmz4Wbd5fm85HSOUe4hi0/lyTKK6114sYBhd1PMGc8mFqiesybvs oJUjyHmdIs/SMVVzN5+RIuK7rvJjPt/iHFJTb32HhyIMgI1RdWGcpjz8qrNcy5hk8vlA lE+xWrrvdPv82GYjoJlzEPd/hZsKMhSYBPlES/5beSt5b2698ivvtNeGt+Ha38wiiHed 7dFghRirDVti9C8buzxFOfxMhFEFkEcLKgnLO/eLRFhC4DhAkftSkYYzlTEdHgVc1lSx P5HH6AkXmQBo1RbqiR6S4OetUHFFiroqwYgpLaWojCAYdcgmfwLrZ7mJzjL+nAd+XeKI /OOg== X-Gm-Message-State: AOAM531CCL7PzRtrZ6nPk2d1VbkoZ3OdUymnkNVB9gVsZ0eXEIBoh2Fk c1hGVAtsuB1l1D8K2iXztkU= X-Google-Smtp-Source: ABdhPJwqvcI3wGyVmABSaYBJRnwrF2eTs+saCLZYg+lluLMzqvEOWJNH8GFu5PCjSHRlesroiuhjqA== X-Received: by 2002:a65:558a:: with SMTP id j10mr17960023pgs.370.1607340829629; Mon, 07 Dec 2020 03:33:49 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.33.46 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:33:49 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [RFC V2 07/37] dmem: trace core functions Date: Mon, 7 Dec 2020 19:31:00 +0800 Message-Id: <4ee2b130c35367a6a3e7b631c872b824a1f59d23.1607332046.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang Add tracepoints for dmem alloc_init, alloc and free functions, that helps us to figure out what is happening inside dmem allocator Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- fs/dmemfs/Makefile | 1 + fs/dmemfs/inode.c | 5 ++++ fs/dmemfs/trace.h | 54 +++++++++++++++++++++++++++++++++++ include/trace/events/dmem.h | 68 +++++++++++++++++++++++++++++++++++++++++++++ mm/dmem.c | 6 ++++ 5 files changed, 134 insertions(+) create mode 100644 fs/dmemfs/trace.h create mode 100644 include/trace/events/dmem.h diff --git a/fs/dmemfs/Makefile b/fs/dmemfs/Makefile index 73bdc9c..0b36d03 100644 --- a/fs/dmemfs/Makefile +++ b/fs/dmemfs/Makefile @@ -2,6 +2,7 @@ # # Makefile for the linux dmem-filesystem routines. # +ccflags-y += -I $(srctree)/$(src) # needed for trace events obj-$(CONFIG_DMEM_FS) += dmemfs.o dmemfs-y += inode.o diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index 9ec62dc..7723b58 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -31,6 +31,9 @@ MODULE_AUTHOR("Tencent Corporation"); MODULE_LICENSE("GPL v2"); +#define CREATE_TRACE_POINTS +#include "trace.h" + struct dmemfs_mount_opts { unsigned long dpage_size; }; @@ -336,6 +339,7 @@ static void *find_radix_entry_or_next(struct address_space *mapping, offset += inode->i_sb->s_blocksize; dpages--; mapping->nrexceptional++; + trace_dmemfs_radix_tree_insert(index, entry); index++; } @@ -532,6 +536,7 @@ static void inode_drop_dpages(struct inode *inode, loff_t start, loff_t end) break; xa_erase(&mapping->i_pages, istart); + trace_dmemfs_radix_tree_delete(istart, pvec.pages[i]); mapping->nrexceptional--; addr = dmem_entry_to_addr(inode, pvec.pages[i]); diff --git a/fs/dmemfs/trace.h b/fs/dmemfs/trace.h new file mode 100644 index 00000000..cc11653 --- /dev/null +++ b/fs/dmemfs/trace.h @@ -0,0 +1,54 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/** + * trace.h - DesignWare Support + * + * Copyright (C) + * + * Author: Xiao Guangrong + */ + +#undef TRACE_SYSTEM +#define TRACE_SYSTEM dmemfs + +#if !defined(_TRACE_DMEMFS_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_DMEMFS_H + +#include + +DECLARE_EVENT_CLASS(dmemfs_radix_tree_class, + TP_PROTO(unsigned long index, void *rentry), + TP_ARGS(index, rentry), + + TP_STRUCT__entry( + __field(unsigned long, index) + __field(void *, rentry) + ), + + TP_fast_assign( + __entry->index = index; + __entry->rentry = rentry; + ), + + TP_printk("index %lu entry %#lx", __entry->index, + (unsigned long)__entry->rentry) +); + +DEFINE_EVENT(dmemfs_radix_tree_class, dmemfs_radix_tree_insert, + TP_PROTO(unsigned long index, void *rentry), + TP_ARGS(index, rentry) +); + +DEFINE_EVENT(dmemfs_radix_tree_class, dmemfs_radix_tree_delete, + TP_PROTO(unsigned long index, void *rentry), + TP_ARGS(index, rentry) +); +#endif + +#undef TRACE_INCLUDE_PATH +#define TRACE_INCLUDE_PATH . + +#undef TRACE_INCLUDE_FILE +#define TRACE_INCLUDE_FILE trace + +/* This part must be outside protection */ +#include diff --git a/include/trace/events/dmem.h b/include/trace/events/dmem.h new file mode 100644 index 00000000..10d1b90 --- /dev/null +++ b/include/trace/events/dmem.h @@ -0,0 +1,68 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM dmem + +#if !defined(_TRACE_DMEM_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_DMEM_H + +#include + +TRACE_EVENT(dmem_alloc_init, + TP_PROTO(unsigned long dpage_shift), + TP_ARGS(dpage_shift), + + TP_STRUCT__entry( + __field(unsigned long, dpage_shift) + ), + + TP_fast_assign( + __entry->dpage_shift = dpage_shift; + ), + + TP_printk("dpage_shift %lu", __entry->dpage_shift) +); + +TRACE_EVENT(dmem_alloc_pages_node, + TP_PROTO(phys_addr_t addr, int node, int try_max, int result_nr), + TP_ARGS(addr, node, try_max, result_nr), + + TP_STRUCT__entry( + __field(phys_addr_t, addr) + __field(int, node) + __field(int, try_max) + __field(int, result_nr) + ), + + TP_fast_assign( + __entry->addr = addr; + __entry->node = node; + __entry->try_max = try_max; + __entry->result_nr = result_nr; + ), + + TP_printk("addr %#lx node %d try_max %d result_nr %d", + (unsigned long)__entry->addr, __entry->node, + __entry->try_max, __entry->result_nr) +); + +TRACE_EVENT(dmem_free_pages, + TP_PROTO(phys_addr_t addr, int dpages_nr), + TP_ARGS(addr, dpages_nr), + + TP_STRUCT__entry( + __field(phys_addr_t, addr) + __field(int, dpages_nr) + ), + + TP_fast_assign( + __entry->addr = addr; + __entry->dpages_nr = dpages_nr; + ), + + TP_printk("addr %#lx dpages_nr %d", (unsigned long)__entry->addr, + __entry->dpages_nr) +); +#endif + +/* This part must be outside protection */ +#include diff --git a/mm/dmem.c b/mm/dmem.c index a77a064..aa34bf2 100644 --- a/mm/dmem.c +++ b/mm/dmem.c @@ -18,6 +18,8 @@ #include #include +#define CREATE_TRACE_POINTS +#include /* * There are two kinds of page in dmem management: * - nature page, it's the CPU's page size, i.e, 4K on x86 @@ -559,6 +561,8 @@ int dmem_alloc_init(unsigned long dpage_shift) mutex_lock(&dmem_pool.lock); + trace_dmem_alloc_init(dpage_shift); + if (dmem_pool.dpage_shift) { /* * double init on the same page size is okay @@ -686,6 +690,7 @@ int dmem_alloc_init(unsigned long dpage_shift) } } + trace_dmem_alloc_pages_node(addr, node, try_max, *result_nr); mutex_unlock(&dmem_pool.lock); } return addr; @@ -791,6 +796,7 @@ void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr) mutex_lock(&dmem_pool.lock); + trace_dmem_free_pages(addr, dpages_nr); WARN_ON(!dmem_pool.dpage_shift); dregion = find_dmem_region(addr, &pdnode); From patchwork Mon Dec 7 11:31:01 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955345 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 197C4C2BBD4 for ; Mon, 7 Dec 2020 11:35:12 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 00C87233A0 for ; Mon, 7 Dec 2020 11:35:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727042AbgLGLeq (ORCPT ); Mon, 7 Dec 2020 06:34:46 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40834 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726989AbgLGLee (ORCPT ); Mon, 7 Dec 2020 06:34:34 -0500 Received: from mail-pg1-x543.google.com (mail-pg1-x543.google.com [IPv6:2607:f8b0:4864:20::543]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9A1C9C061A51; Mon, 7 Dec 2020 03:33:54 -0800 (PST) Received: by mail-pg1-x543.google.com with SMTP id w16so8638803pga.9; Mon, 07 Dec 2020 03:33:54 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=tcGaTFLZIvDv/+RZUQXw2n3pLOeGqC9LGmpKUSJxIHk=; b=LnlBNsT2Ow4sDfC0QMME7MWuZITq+AEcH1sfAXBFXueeGXTmgyO9JdwNHIFIB4RP28 gn5IWNFLijhXsJzSsKaUnVdGCQx2gyWEBTqF0Kvs61fwySFHpYos92ROZTSt5SHfNHpP OeImMtiG3vU9Uq5eC80XJMC+Whn9dkIcqFxDKCOBa6a4es99sdy7EHJfbDRbZezvytMN w02FAN9w/yfckdzYqHtTMk3cHALFlV1gyfVTQPLKMmQmnHecGcHt0t+MIVhr8i6BF3VP R9cuJswy7JR4RuuusdzKlbCS7tt+IM1+yJIPMKGbccth3u1CnENv8XPxLf4u/SW5Qm4e W00Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=tcGaTFLZIvDv/+RZUQXw2n3pLOeGqC9LGmpKUSJxIHk=; b=RLyLLfXN+0jDFvOIjRU8xfBXviObaEPZ+k3Uq1lpcYe/Xz1k0XqMQqoF5/TwuRVP8t iOVD3TYtiREhF6OReTOr6yQ5mA85tFSzw/vKtouw7L3ztt6x4uHZtJTfvDtBvo+jrZD8 DonnnEk6aMsGFtEkNLCONYrdaOnSYmZFKIhR37hg9REwZBrTUTJKZRTStZdRDI6NAA52 ZTSD7T1+gXiam6Hfb3+5AOlwY7xDLK3oArS6t5oN/TL/XjeGCANnLTzKdKLf1Qjhyh2i kEHRMSrHAdq5PX6BJLHxc+rIRNVVcarpewgBvcEFe+SC5qoGyfFfF2p2alu4Q34S2WNn wCxw== X-Gm-Message-State: AOAM532t+bhIh8PFNJZguXasf6HOhldj5EhJBp95btca+092TYcp8k7r +3ccFWRWqqV1FyjRq1cwyfk= X-Google-Smtp-Source: ABdhPJwuFOmVar8+NpPkcSNRrOAxFN6p9lyUDwrJbLZ14eN6XyJlArx3zJXtTYoeeUDbEsE3lQb1/A== X-Received: by 2002:a63:5d59:: with SMTP id o25mr17481331pgm.218.1607340834222; Mon, 07 Dec 2020 03:33:54 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.33.51 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:33:53 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [RFC V2 08/37] dmem: show some statistic in debugfs Date: Mon, 7 Dec 2020 19:31:01 +0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang Create 'dmem' directory under debugfs and show some statistic for dmem pool, track total and free dpages on dmem pool and each numa node. Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- mm/Kconfig | 8 +++++ mm/dmem.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 107 insertions(+), 1 deletion(-) diff --git a/mm/Kconfig b/mm/Kconfig index 3a6d408..4dd8896 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -234,6 +234,14 @@ config DMEM Allow reservation of memory which could be for the dedicated use of dmem. It's the basis of dmemfs. +config DMEM_DEBUG_FS + bool "Enable debug information for direct memory" + depends on DMEM && DEBUG_FS + help + This option enables showing various statistics of direct memory + in debugfs filesystem. + +# # support for memory compaction config COMPACTION bool "Allow for memory compaction" diff --git a/mm/dmem.c b/mm/dmem.c index aa34bf2..6992e57 100644 --- a/mm/dmem.c +++ b/mm/dmem.c @@ -164,6 +164,103 @@ int dmem_region_register(int node, phys_addr_t start, phys_addr_t end) return 0; } +#ifdef CONFIG_DMEM_DEBUG_FS +struct debugfs_entry { + const char *name; + unsigned long offset; +}; + +#define DMEM_POOL_OFFSET(x) offsetof(struct dmem_pool, x) +#define DMEM_POOL_ENTRY(x) {__stringify(x), DMEM_POOL_OFFSET(x)} + +#define DMEM_NODE_OFFSET(x) offsetof(struct dmem_node, x) +#define DMEM_NODE_ENTRY(x) {__stringify(x), DMEM_NODE_OFFSET(x)} + +static struct debugfs_entry dmem_pool_entries[] = { + DMEM_POOL_ENTRY(region_num), + DMEM_POOL_ENTRY(registered_pages), + DMEM_POOL_ENTRY(unaligned_pages), + DMEM_POOL_ENTRY(dpage_shift), + DMEM_POOL_ENTRY(total_dpages), + DMEM_POOL_ENTRY(free_dpages), +}; + +static struct debugfs_entry dmem_node_entries[] = { + DMEM_NODE_ENTRY(total_dpages), + DMEM_NODE_ENTRY(free_dpages), +}; + +static int dmem_entry_get(void *offset, u64 *val) +{ + *val = *(u64 *)offset; + return 0; +} + +DEFINE_SIMPLE_ATTRIBUTE(dmem_fops, dmem_entry_get, NULL, "%llu\n"); + +static int dmemfs_init_debugfs_node(struct dmem_node *dnode, + struct dentry *parent) +{ + struct dentry *node_dir; + char dir_name[32]; + int i, ret = -EEXIST; + + snprintf(dir_name, sizeof(dir_name), "node%ld", + dnode - dmem_pool.nodes); + node_dir = debugfs_create_dir(dir_name, parent); + if (!node_dir) + return ret; + + for (i = 0; i < ARRAY_SIZE(dmem_node_entries); i++) + if (!debugfs_create_file(dmem_node_entries[i].name, 0444, + node_dir, (void *)dnode + dmem_node_entries[i].offset, + &dmem_fops)) + return ret; + return 0; +} + +static int dmemfs_init_debugfs(void) +{ + struct dentry *dmem_debugfs_dir; + struct dmem_node *dnode; + int i, ret = -EEXIST; + + dmem_debugfs_dir = debugfs_create_dir("dmem", NULL); + if (!dmem_debugfs_dir) + return ret; + + for (i = 0; i < ARRAY_SIZE(dmem_pool_entries); i++) + if (!debugfs_create_file(dmem_pool_entries[i].name, 0444, + dmem_debugfs_dir, + (void *)&dmem_pool + dmem_pool_entries[i].offset, + &dmem_fops)) + goto exit; + + for_each_dmem_node(dnode) { + /* + * do not create debugfs files for the node + * where no memory is available + */ + if (list_empty(&dnode->regions)) + continue; + + if (dmemfs_init_debugfs_node(dnode, dmem_debugfs_dir)) + goto exit; + } + + return 0; +exit: + debugfs_remove_recursive(dmem_debugfs_dir); + return ret; +} + +#else +static int dmemfs_init_debugfs(void) +{ + return 0; +} +#endif + #define PENALTY_FOR_DMEM_SHARED_NODE (1) static int dmem_nodeload[MAX_NUMNODES] __initdata; @@ -364,7 +461,8 @@ static int __init dmem_late_init(void) goto exit; } } - return ret; + + return dmemfs_init_debugfs(); exit: dmem_uinit(); return ret; From patchwork Mon Dec 7 11:31:02 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955351 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 295ECC3527E for ; Mon, 7 Dec 2020 11:35:14 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DC2E923340 for ; Mon, 7 Dec 2020 11:35:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727070AbgLGLeu (ORCPT ); Mon, 7 Dec 2020 06:34:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40876 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727059AbgLGLet (ORCPT ); Mon, 7 Dec 2020 06:34:49 -0500 Received: from mail-pf1-x443.google.com (mail-pf1-x443.google.com [IPv6:2607:f8b0:4864:20::443]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B0262C061A54; Mon, 7 Dec 2020 03:33:59 -0800 (PST) Received: by mail-pf1-x443.google.com with SMTP id w6so9616581pfu.1; Mon, 07 Dec 2020 03:33:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=hZ7s/TPFWElvYaRd2Z4ImP4zL6ID97NqxxjlgrDo46E=; b=AJAm2JQDxedsaO81FFlH0EqySZx0V1JZAWUyZSgG2JMejw/2P/2YNsn13l0harXSnv U3Gyun4C/tDEp5GY75ukYQ8NKpN7HVEwvvgyY8BzE7ExfFNdBMDEZ/aOgvx075MECqL7 pYdV9tIyA5ilsGCR6SY4i+lSEV5W2rx9BftPjIVtKxyjCDnbwztLge9MJFNY8KOYMREE wy7RQIV+YMYcC9VSimCcE1RnrIyOSmzDJjr9L/rCvjGXRPmPiQpqhp+CP4j7obar6c5B qxRX1Y83xXr1EcYfim8OijyqJDwojAyLtA46wHnMUKFxo4aQcxCnXcAgkjCcPB8f/9ef owvg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=hZ7s/TPFWElvYaRd2Z4ImP4zL6ID97NqxxjlgrDo46E=; b=DwwkgLz3GSwfIi30Hjq7iJr4GY0lkbU6ataj6ZkHMh6lrHIBA2R4LtsbLyGdflLlFx xyDaY9vb0HiHInIatvkhOyv2D5WKjIrkbLnauUEL7qGdPVRkpF3k6R2KDm2fbsg2mmRe wBxcqM8clvJV929R8R5UidRwsrOXKL5PxZ2gwVIoQuH7um9bwod54WJpqKvK0r8iJPNA wqbqSNlAZ9mDQpDDRpb8mOtdTFsvQZlwcZl8Qp/ELHfU3oIAWoA2MweBfGFbv40JDq+S D/cg+PIMWg67DZ7md/Q0UtQvWXga+3Ov9OIgJ3cyrjWTR74Zc82hNlCriP4HH9FAlhAW LK/w== X-Gm-Message-State: AOAM5308/8YrfOy4HRgozIB/UHvk+9JkIvvd6FkZ5E830NlhZT4uXwwi 0jyK5LlxbqI2wJw5azGG8Ec= X-Google-Smtp-Source: ABdhPJwNOCi3KR2PAp9drrOXOVZPvcKF+MCuZ8w1Uu6pFaCaU7qErwtZSzlK+Cv5rjtTzxX3eZsSAw== X-Received: by 2002:a17:902:aa84:b029:da:f114:6022 with SMTP id d4-20020a170902aa84b02900daf1146022mr5850388plr.46.1607340839351; Mon, 07 Dec 2020 03:33:59 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.33.56 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:33:58 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [RFC V2 09/37] dmemfs: support remote access Date: Mon, 7 Dec 2020 19:31:02 +0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang It is required by ptrace_writedata and ptrace_readdata to access dmem memory remotely. The typical user is gdb, after this patch, gdb is able to read & write memory owned by the attached process Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- fs/dmemfs/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 46 insertions(+) diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index 7723b58..3192f31 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -364,6 +364,51 @@ static void radix_put_entry(void) rcu_read_unlock(); } +static bool check_vma_access(struct vm_area_struct *vma, int write) +{ + vm_flags_t vm_flags = write ? VM_WRITE : VM_READ; + + return !!(vm_flags & vma->vm_flags); +} + +static int +dmemfs_access_dmem(struct vm_area_struct *vma, unsigned long addr, + void *buf, int len, int write) +{ + struct inode *inode = file_inode(vma->vm_file); + struct super_block *sb = inode->i_sb; + void *entry, *maddr; + int offset, pgoff; + + if (!check_vma_access(vma, write)) + return -EACCES; + + pgoff = linear_page_index(vma, addr); + if (pgoff > (MAX_LFS_FILESIZE >> PAGE_SHIFT)) + return -EFAULT; + + entry = radix_get_create_entry(vma, addr, inode, pgoff); + if (IS_ERR(entry)) + return PTR_ERR(entry); + + offset = addr & (sb->s_blocksize - 1); + addr = dmem_entry_to_addr(inode, entry); + + /* + * it is not beyond vma's region as the vma should be aligned + * to blocksize + */ + len = min(len, (int)(sb->s_blocksize - offset)); + maddr = __va(addr); + if (write) + memcpy(maddr + offset, buf, len); + else + memcpy(buf, maddr + offset, len); + radix_put_entry(); + + return len; +} + static vm_fault_t dmemfs_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; @@ -400,6 +445,7 @@ static unsigned long dmemfs_pagesize(struct vm_area_struct *vma) static const struct vm_operations_struct dmemfs_vm_ops = { .fault = dmemfs_fault, .pagesize = dmemfs_pagesize, + .access = dmemfs_access_dmem, }; int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma) From patchwork Mon Dec 7 11:31:03 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955349 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CDDF6C35273 for ; Mon, 7 Dec 2020 11:35:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 97942233F6 for ; Mon, 7 Dec 2020 11:35:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727078AbgLGLeu (ORCPT ); Mon, 7 Dec 2020 06:34:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40878 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727060AbgLGLet (ORCPT ); Mon, 7 Dec 2020 06:34:49 -0500 Received: from mail-pf1-x442.google.com (mail-pf1-x442.google.com [IPv6:2607:f8b0:4864:20::442]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 77498C061A56; Mon, 7 Dec 2020 03:34:04 -0800 (PST) Received: by mail-pf1-x442.google.com with SMTP id f9so9014379pfc.11; Mon, 07 Dec 2020 03:34:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=qTNI65Ra3SvOxRqRKBzegtUpucaZ64C3XLV6YdV+FWI=; b=WZ7Du0ULB95oU81wnLc5gl3z/aytDZ/R8hi+jxeHPvL3Uj6qbl1w6qJYy8PYblreD7 KkS3EAfoEHij1nALKdPH1U4pxPS0IyvqqemIQiGQMyDBC7uDpffb/MAkinfFKdVcbzW+ ccqmpEumCFtRql9bw4TNMZX2lqp8gs6naZPHyGLUqJUJGO5hakqfFrv5eThcK9prsbyq rA5jkLCAuckYLJvmJ/7JoZy+hde7C1WAvWqFV42rW7dKwCVNU9p/OVP/OE9bqgO1xMQS M+Ap2sPI4y9/jiLNfsu4126AgBFlUfyiLrevHYALZF4wOYdHEUv5qkHHoVqj09LrY8u0 OGAg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=qTNI65Ra3SvOxRqRKBzegtUpucaZ64C3XLV6YdV+FWI=; b=JQ+8VJ8wZqNoIcp0eQC6nBfq333YBWE75c3UrxegZ9fN80uIbvBxjYnQqxvG5EYMTi jIxvIYJ8+4J/85cMjUrRjCmVx2mp8417EAxl1mAtf/Ic/ZPw6walXm3JPq41Ntg8xwAp EWAOaFo6eHahxiklUAOSAh7DuDXkVsHYTOqmf5co0GjshmqahmasGHNo7I+mbYYkJ1hu WUHxT4tq9yp8Lx0G35o7WLjrWo5jW9xetzwj0CSVikT8CFkkrrHeCsoPVbnHaqHLVcry bbm3DpUAGBVpVNgOqQIcG5pjSIbVmbBJ30gpNAOHalGqsToQLEWYPjA/NvtvFwoD1IH3 otdQ== X-Gm-Message-State: AOAM530XFEnAQwlfQabe6phLV40gyCobXLeI3kUrtez3st0N+Vv2+cqc +R0jnMBWoJME8o/gZOtNUOs= X-Google-Smtp-Source: ABdhPJzLlbx7N/WW+brtTZmzWtaMEPGG1TMIRCFOt3UUqUvhW972s9iQVTyzpcnADKQPTclv8/zM4g== X-Received: by 2002:a62:5205:0:b029:19e:a0f:2c81 with SMTP id g5-20020a6252050000b029019e0a0f2c81mr4652442pfb.50.1607340844089; Mon, 07 Dec 2020 03:34:04 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.34.00 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:34:03 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [RFC V2 10/37] dmemfs: introduce max_alloc_try_dpages parameter Date: Mon, 7 Dec 2020 19:31:03 +0800 Message-Id: <08ff7e40806a2342720835b95f9be24d5778c703.1607332046.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang It specifies the dmem page number allocated at one time, then multiple radix entries can be created. That will relief the allocation pressure and make page fault more fast. However that could cause no dmem page mmapped to userspace even if there are some free dmem pages. Set it to 1 to completely disable this behavior. Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- fs/dmemfs/inode.c | 41 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index 3192f31..443f2e1 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -34,6 +34,8 @@ #define CREATE_TRACE_POINTS #include "trace.h" +static uint __read_mostly max_alloc_try_dpages = 1; + struct dmemfs_mount_opts { unsigned long dpage_size; }; @@ -46,6 +48,44 @@ enum dmemfs_param { Opt_dpagesize, }; +static int +max_alloc_try_dpages_set(const char *val, const struct kernel_param *kp) +{ + uint sval; + int ret; + + ret = kstrtouint(val, 0, &sval); + if (ret) + return ret; + + /* should be 1 at least */ + if (!sval) + return -EINVAL; + + max_alloc_try_dpages = sval; + return 0; +} + +static struct kernel_param_ops alloc_max_try_dpages_ops = { + .set = max_alloc_try_dpages_set, + .get = param_get_uint, +}; + +/* + * it specifies the dmem page number allocated at one time, then + * multiple radix entries can be created. That will relief the + * allocation pressure and make page fault more fast. + * + * however that could cause no dmem page mmapped to userspace + * even if there are some free dmem pages + * + * set it to 1 to completely disable this behavior + */ +fs_param_cb(max_alloc_try_dpages, &alloc_max_try_dpages_ops, + &max_alloc_try_dpages, 0644); +__MODULE_PARM_TYPE(max_alloc_try_dpages, "uint"); +MODULE_PARM_DESC(max_alloc_try_dpages, "Set the dmem page number allocated at one time, should be 1 at least"); + const struct fs_parameter_spec dmemfs_fs_parameters[] = { fsparam_string("pagesize", Opt_dpagesize), {} @@ -314,6 +354,7 @@ static void *find_radix_entry_or_next(struct address_space *mapping, } rcu_read_unlock(); + try_dpages = min(try_dpages, max_alloc_try_dpages); /* entry does not exist, create it */ addr = dmem_alloc_pages_vma(vma, fault_addr, try_dpages, &dpages); if (!addr) { From patchwork Mon Dec 7 11:31:04 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955347 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7725AC3526D for ; Mon, 7 Dec 2020 11:35:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 50564233FA for ; Mon, 7 Dec 2020 11:35:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727135AbgLGLe7 (ORCPT ); Mon, 7 Dec 2020 06:34:59 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40898 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727094AbgLGLe4 (ORCPT ); Mon, 7 Dec 2020 06:34:56 -0500 Received: from mail-pf1-x443.google.com (mail-pf1-x443.google.com [IPv6:2607:f8b0:4864:20::443]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 85852C0613D0; Mon, 7 Dec 2020 03:34:16 -0800 (PST) Received: by mail-pf1-x443.google.com with SMTP id 11so3391456pfu.4; Mon, 07 Dec 2020 03:34:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=oiWdzjaw1OmADmhfyVOwsM+WBviDc5Qv5LoJ9srjCn8=; b=GojkmbMdLZzqltnpM8dqoYBuO8JaFV8vAaCoFvmgCZQqrwLo0y16hu1kT/YygK2K44 QgsvracpXFSOoq10bZ7qguwwQu1ru9XSbbiMvPp7eesw75iZj6ZVQsKNtR0IjjS26QMH cFd6l1CrJHxoPJQJu6/K0t0PkR9kVQ4MSXDRj7v5lcuc0TVQB2XR1UsK6wwur1N5vxSU AwwTCEI7iBkKTY/Nxu9R233DLfjsq3wT7bb41Rt1La/5OFAq97XS6aCduvxnutsk9ctX w+0uFdcrlcAyAqvQc0wAp/QKxdlcnw+cQ5Vx+NJlV1uHfISi3iC7tddWQEiSnJWzi119 ZeCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=oiWdzjaw1OmADmhfyVOwsM+WBviDc5Qv5LoJ9srjCn8=; b=AyG4r3He7t8088duijeyQzhRKz0Hb+CUjMR2YFBkMOqujld3guXDn/flD4qjb3tg4g oGxxJpobdaXjtC22SshdZyAs7siZPLUbRVGziWU33RlcLVYnrnXGvB1aJe73fn63tboN p2620a4OvfbhM2ud3W6lsEALq8S+HZM9krVxTjE+G8M1CrhdUt8BOz4bZ+HXwaPD/u6d fOdusf2+DwYXuAuTeixzre+BXw7roU2hRloI1wm7mo5fMde88YnKPAvhlL8tyCx6LlXy jSsQpgDUEINbEiNyG2jbnfB8y6aYqnIXTFcRUG3A049zBqLySNP4z+fz6VTnb2gOEYSP /r+g== X-Gm-Message-State: AOAM533cCeqMe7tFwjRNZ0wRmRTo/tIn6gURBSqH8+hoMSXpe0iaf/0W n8qkQqJrZA07vTy/KQszyZ0= X-Google-Smtp-Source: ABdhPJxqMp4yjvhA/Bg9lRA+V9tn9lbXheFRUjSeNQl1fvqRvMg3Qy9PvhPeukmBJN6/2hsxmmnCGQ== X-Received: by 2002:a63:ce0c:: with SMTP id y12mr18084599pgf.208.1607340856109; Mon, 07 Dec 2020 03:34:16 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.34.13 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:34:15 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang Subject: [RFC V2 11/37] mm: export mempolicy interfaces to serve dmem allocator Date: Mon, 7 Dec 2020 19:31:04 +0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang Export interface interleave_nid() to serve dmem allocator. Signed-off-by: Yulei Zhang --- include/linux/mempolicy.h | 3 +++ mm/mempolicy.c | 4 ++-- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 5f1c74d..4789661 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -139,6 +139,9 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp, struct mempolicy *get_task_policy(struct task_struct *p); struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, unsigned long addr); +struct mempolicy *get_vma_policy(struct vm_area_struct *vma, unsigned long addr); +unsigned interleave_nid(struct mempolicy *pol, struct vm_area_struct *vma, + unsigned long addr, int shift); bool vma_policy_mof(struct vm_area_struct *vma); extern void numa_default_policy(void); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 3ca4898..efd80e5 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1813,7 +1813,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, * freeing by another task. It is the caller's responsibility to free the * extra reference for shared policies. */ -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma, +struct mempolicy *get_vma_policy(struct vm_area_struct *vma, unsigned long addr) { struct mempolicy *pol = __get_vma_policy(vma, addr); @@ -1978,7 +1978,7 @@ static unsigned offset_il_node(struct mempolicy *pol, unsigned long n) } /* Determine a node number for interleave */ -static inline unsigned interleave_nid(struct mempolicy *pol, +unsigned interleave_nid(struct mempolicy *pol, struct vm_area_struct *vma, unsigned long addr, int shift) { if (vma) { From patchwork Mon Dec 7 11:31:05 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955341 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 517DAC2BB3F for ; Mon, 7 Dec 2020 11:35:11 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 13C7B233A0 for ; Mon, 7 Dec 2020 11:35:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727002AbgLGLeg (ORCPT ); Mon, 7 Dec 2020 06:34:36 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40784 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726992AbgLGLef (ORCPT ); Mon, 7 Dec 2020 06:34:35 -0500 Received: from mail-pg1-x52e.google.com (mail-pg1-x52e.google.com [IPv6:2607:f8b0:4864:20::52e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 02578C0613D1; Mon, 7 Dec 2020 03:34:20 -0800 (PST) Received: by mail-pg1-x52e.google.com with SMTP id o5so8645987pgm.10; Mon, 07 Dec 2020 03:34:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ydAF98InOXvLOUHWcQbefHcTxGQjuTY7EYp8f1Ij7rU=; b=CbIO/lKMhlmMybymchHGdjYBFjooSqYR87ZMJNCpzamSy0rqGGupNjGohNmWYyuuVF +xqHe0wJyPOa8tX4wGwI9xsV0mLlIIhWZv7yi7zuh1i+ZjbHzI7SroHpaRk8tPLHv8j9 TY3DoXxPGxxbJvXcTFLjLpG9ptLpcORhfr4m/KQm4y0U00zVK1J9SxRLVrdKucZO6fgt Iq53lTb5pWIqLsYf0RPyKOtYavEZdCQ2SpdfqSEQDGGfienBvS0lHPdVmAi0he3F9oSM 2qO8yJB8UNJ6tM0GvVovraZIBAc/ZtLNNhUmoqFgjDcE70JCp5Sfe/0KBwOp0Vx9h2oD SRmw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ydAF98InOXvLOUHWcQbefHcTxGQjuTY7EYp8f1Ij7rU=; b=hSCCnYB9BxxeV4PRIJTgTA1Di8RkrUO7cZn4LF03PmElfyH32SFiyjuWBJw4Q+NMn0 3Ny09qbrj6gTpeOSoUxxyOXVCeo09UYdni9JvyXwuaAsBzmacFJ0i8/XWkYQLcgpJjNQ zElw9pTVmyE4V1+M1ePvLv3RINVsu2UVVS8M8mumAYfRrzCXQwzXITKYXwYQhXQiATPe IyoBy5SKJKli7hZkF2kZVi0SP1xo8V4EW4lVZTH9ZeqZP9jdts9/XstmX4QnU+29e64F 5zEDzNiu8ARxzk5jN2OO31x4qSPOjg0qI6nkRQeV4DW2a5JjFA4yuTe1/CyRd2mcox4w iueQ== X-Gm-Message-State: AOAM531gsAfk0/jOPmRHpe2fkoVvYzX/KQC4+viDUs8SNFosNWGQYWGn zSxy0K8XwLJfbXJJvqGPAlU= X-Google-Smtp-Source: ABdhPJzAC/Gy2AUUTW9sU9fx/7AU0/diRQdzJN4dFsjfTKyWiDt0U8rw4OyPNmxFkhpistj53GkGWw== X-Received: by 2002:aa7:9244:0:b029:19a:b335:754b with SMTP id 4-20020aa792440000b029019ab335754bmr15898559pfp.29.1607340859523; Mon, 07 Dec 2020 03:34:19 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.34.16 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:34:19 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Haiwei Li Subject: [RFC V2 12/37] dmem: introduce mempolicy support Date: Mon, 7 Dec 2020 19:31:05 +0800 Message-Id: <28718e3b8886b9ec3e4700c2d55a9629ca9fc27c.1607332046.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang It adds mempolicy support for dmem to allocates memory from mempolicy specified nodes. Signed-off-by: Haiwei Li Signed-off-by: Yulei Zhang --- arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 7 ++++ arch/x86/include/asm/pgtable_types.h | 13 +++++++- fs/dmemfs/Kconfig | 3 ++ include/linux/pgtable.h | 7 ++++ mm/Kconfig | 3 ++ mm/dmem.c | 63 ++++++++++++++++++++++++++++++++++-- 7 files changed, 94 insertions(+), 3 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index f6946b8..9ccee76 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -73,6 +73,7 @@ config X86 select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE select ARCH_HAS_PMEM_API if X86_64 select ARCH_HAS_PTE_DEVMAP if X86_64 + select ARCH_HAS_PTE_DMEM if X86_64 select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64 select ARCH_HAS_COPY_MC if X86_64 diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index a02c672..dd4aff6 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -452,6 +452,13 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd) return pmd_set_flags(pmd, _PAGE_DEVMAP); } +#ifdef CONFIG_ARCH_HAS_PTE_DMEM +static inline pmd_t pmd_mkdmem(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_SPECIAL | _PAGE_DMEM); +} +#endif + static inline pmd_t pmd_mkhuge(pmd_t pmd) { return pmd_set_flags(pmd, _PAGE_PSE); diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 816b31c..ee4cae1 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -23,6 +23,15 @@ #define _PAGE_BIT_SOFTW2 10 /* " */ #define _PAGE_BIT_SOFTW3 11 /* " */ #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ +#define _PAGE_BIT_DMEM 57 /* Flag used to indicate dmem pmd. + * Since _PAGE_BIT_SPECIAL is defined + * same as _PAGE_BIT_CPA_TEST, we can + * not only use _PAGE_BIT_SPECIAL, so + * add _PAGE_BIT_DMEM to help + * indicate it. Since dmem pte will + * never be splitting, setting + * _PAGE_BIT_SPECIAL for pte is enough. + */ #define _PAGE_BIT_SOFTW4 58 /* available for programmer */ #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */ #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */ @@ -112,9 +121,11 @@ #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) #define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX) #define _PAGE_DEVMAP (_AT(u64, 1) << _PAGE_BIT_DEVMAP) +#define _PAGE_DMEM (_AT(u64, 1) << _PAGE_BIT_DMEM) #else #define _PAGE_NX (_AT(pteval_t, 0)) #define _PAGE_DEVMAP (_AT(pteval_t, 0)) +#define _PAGE_DMEM (_AT(pteval_t, 0)) #endif #define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE) @@ -128,7 +139,7 @@ #define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \ _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \ _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC | \ - _PAGE_UFFD_WP) + _PAGE_UFFD_WP | _PAGE_DMEM) #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE) /* diff --git a/fs/dmemfs/Kconfig b/fs/dmemfs/Kconfig index d2894a5..19ca391 100644 --- a/fs/dmemfs/Kconfig +++ b/fs/dmemfs/Kconfig @@ -1,5 +1,8 @@ config DMEM_FS tristate "Direct Memory filesystem support" + depends on DMEM + depends on TRANSPARENT_HUGEPAGE + depends on ARCH_HAS_PTE_DMEM help dmemfs (Direct Memory filesystem) is device memory or reserved memory based filesystem. This kind of memory is special as it diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 71125a4..9e65694 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1157,6 +1157,13 @@ static inline int pud_trans_unstable(pud_t *pud) #endif } +#ifndef CONFIG_ARCH_HAS_PTE_DMEM +static inline pmd_t pmd_mkdmem(pmd_t pmd) +{ + return pmd; +} +#endif + #ifndef pmd_read_atomic static inline pmd_t pmd_read_atomic(pmd_t *pmdp) { diff --git a/mm/Kconfig b/mm/Kconfig index 4dd8896..10fd7ff 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -794,6 +794,9 @@ config IDLE_PAGE_TRACKING config ARCH_HAS_PTE_DEVMAP bool +config ARCH_HAS_PTE_DMEM + bool + config ZONE_DEVICE bool "Device memory (pmem, HMM, etc...) hotplug support" depends on MEMORY_HOTPLUG diff --git a/mm/dmem.c b/mm/dmem.c index 6992e57..2e61dbd 100644 --- a/mm/dmem.c +++ b/mm/dmem.c @@ -822,6 +822,56 @@ int dmem_alloc_init(unsigned long dpage_shift) } EXPORT_SYMBOL(dmem_alloc_pages_nodemask); +/* Return a nodelist indicated for current node representing a mempolicy */ +static int *policy_nodelist(struct mempolicy *policy) +{ + int nd = numa_node_id(); + + switch (policy->mode) { + case MPOL_PREFERRED: + if (!(policy->flags & MPOL_F_LOCAL)) + nd = policy->v.preferred_node; + break; + case MPOL_BIND: + if (unlikely(!node_isset(nd, policy->v.nodes))) + nd = first_node(policy->v.nodes); + break; + default: + WARN_ON(1); + } + return dmem_nodelist(nd); +} + +static nodemask_t *dmem_policy_nodemask(struct mempolicy *policy) +{ + if (unlikely(policy->mode == MPOL_BIND) && + cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)) + return &policy->v.nodes; + + return NULL; +} + +static void +get_mempolicy_nlist_and_nmask(struct mempolicy *pol, + struct vm_area_struct *vma, unsigned long addr, + int **nl, nodemask_t **nmask) +{ + if (pol->mode == MPOL_INTERLEAVE) { + unsigned int nid; + + /* + * we use dpage_shift to interleave numa nodes although + * multiple dpages may be allocated + */ + nid = interleave_nid(pol, vma, addr, dmem_pool.dpage_shift); + *nl = dmem_nodelist(nid); + *nmask = NULL; + } else { + *nl = policy_nodelist(pol); + *nmask = dmem_policy_nodemask(pol); + } +} + /* * dmem_alloc_pages_vma - Allocate pages for a VMA. * @@ -830,6 +880,9 @@ int dmem_alloc_init(unsigned long dpage_shift) * @try_max: try to allocate @try_max dpages if possible * @result_nr: allocated dpage number returned to the caller * + * This function allocates pages from dmem pool and applies a NUMA policy + * associated with the VMA. + * * Return the physical address of the first dpage allocated from dmem * pool, or 0 on failure. The allocated dpage number is filled into * @result_nr @@ -839,13 +892,19 @@ int dmem_alloc_init(unsigned long dpage_shift) unsigned int try_max, unsigned int *result_nr) { phys_addr_t phys_addr; + struct mempolicy *pol; int *nl; + nodemask_t *nmask; unsigned int cpuset_mems_cookie; retry_cpuset: - nl = dmem_nodelist(numa_node_id()); + pol = get_vma_policy(vma, addr); + cpuset_mems_cookie = read_mems_allowed_begin(); + + get_mempolicy_nlist_and_nmask(pol, vma, addr, &nl, &nmask); + mpol_cond_put(pol); - phys_addr = dmem_alloc_pages_from_nodelist(nl, NULL, try_max, + phys_addr = dmem_alloc_pages_from_nodelist(nl, nmask, try_max, result_nr); if (unlikely(!phys_addr && read_mems_allowed_retry(cpuset_mems_cookie))) goto retry_cpuset; From patchwork Mon Dec 7 11:31:06 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955343 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B5CAAC2BBCA for ; Mon, 7 Dec 2020 11:35:11 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8B26523340 for ; Mon, 7 Dec 2020 11:35:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727016AbgLGLej (ORCPT ); Mon, 7 Dec 2020 06:34:39 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40798 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726992AbgLGLei (ORCPT ); Mon, 7 Dec 2020 06:34:38 -0500 Received: from mail-pf1-x435.google.com (mail-pf1-x435.google.com [IPv6:2607:f8b0:4864:20::435]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5A422C0613D3; Mon, 7 Dec 2020 03:34:23 -0800 (PST) Received: by mail-pf1-x435.google.com with SMTP id w6so9617930pfu.1; Mon, 07 Dec 2020 03:34:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=JTgCCtK3neklBBZP7ALPG1Lf4m1ypdSzhK8xGX8aX9M=; b=SMiGuwjR9VTZexC3BrV+8B6agmZASS9JEaHnUZ2E1SWgJbsxMfQOcftyniDMaI4UtS HqH6yx1ijBUM+P9SqB+YEbgYrBhqCjmLdV5Rg7vXsyZ5jaSaXRaZJmc3Yjxq+eEZG4oV nX66/M0q5EhEDZF9/LBmyncihaCPD6I2LGmGewQmo7Ga9FfsB0ouuSEkUUjD6lxI+caB D6bY535ziVmj3fxBixW3OY+kJ46tqvHPAMERQCF66rPJh2efyUOOl5tjzf+zzNRblBCI Xb2Fj9L2Eaf0AdeetZGtjqVgVV4QkTfvH80QfKPg+23LmZ8AusMLBfv6lkdZ/XnKYpIr aHeg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=JTgCCtK3neklBBZP7ALPG1Lf4m1ypdSzhK8xGX8aX9M=; b=lQMXEqI8Az+Vd4vTC4KROMF4iGJIBVQRJTu7cT9MT0YYypNifIfwU4LP609D+R/sOy fGZrZgXi8tPepprVItAhxnLOeeopak52JUkIOls8Ue29hH2vHbmbjhn/8RF3fRHTIi2X 7hYKZ967Q1v0aRkVd8/Gp/n0DUxQ9oeD0kmi0uhHVplAmaOfMm14D29lhoqi2OreA14g 1hyfGSUCERWGjxRFM1lLWLUyUJNdBdNOdQJZNv4O5Fv9U5L2Rbsr+MskxfFR/kVwHaHL L/YcqwJoYCllOb+GYVtZinQNUo/OvAerkW2p71Pl88Vsmbtn+NyI5PmVSAw9Pii6JIV6 VC2A== X-Gm-Message-State: AOAM530j2SaSL8r9Kb9DfFwvzw7MDJ5NcncJYvZHcChb5yWvAskcMEYZ /ggmBAv56tP5FPFvhFIdQyI= X-Google-Smtp-Source: ABdhPJzkG6zV67FqF8oje1zqM80ZaXke6YmOfj+FEi4/qDBrNc9iaM3WstHda8DEvRc0vm268vDR+g== X-Received: by 2002:a63:445c:: with SMTP id t28mr17760750pgk.373.1607340862967; Mon, 07 Dec 2020 03:34:22 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.34.19 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:34:22 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 13/37] mm, dmem: introduce PFN_DMEM and pfn_t_dmem Date: Mon, 7 Dec 2020 19:31:06 +0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang Introduce PFN_DMEM as a new pfn flag for dmem pfn, define it by setting (BITS_PER_LONG_LONG - 6) bit. Introduce pfn_t_dmem() helper to recognize dmem pfn. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- include/linux/pfn_t.h | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h index 2d91482..c6c0f1f 100644 --- a/include/linux/pfn_t.h +++ b/include/linux/pfn_t.h @@ -11,6 +11,7 @@ * PFN_MAP - pfn has a dynamic page mapping established by a device driver * PFN_SPECIAL - for CONFIG_FS_DAX_LIMITED builds to allow XIP, but not * get_user_pages + * PFN_DMEM - pfn references a dmem page */ #define PFN_FLAGS_MASK (((u64) (~PAGE_MASK)) << (BITS_PER_LONG_LONG - PAGE_SHIFT)) #define PFN_SG_CHAIN (1ULL << (BITS_PER_LONG_LONG - 1)) @@ -18,13 +19,15 @@ #define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3)) #define PFN_MAP (1ULL << (BITS_PER_LONG_LONG - 4)) #define PFN_SPECIAL (1ULL << (BITS_PER_LONG_LONG - 5)) +#define PFN_DMEM (1ULL << (BITS_PER_LONG_LONG - 6)) #define PFN_FLAGS_TRACE \ { PFN_SPECIAL, "SPECIAL" }, \ { PFN_SG_CHAIN, "SG_CHAIN" }, \ { PFN_SG_LAST, "SG_LAST" }, \ { PFN_DEV, "DEV" }, \ - { PFN_MAP, "MAP" } + { PFN_MAP, "MAP" }, \ + { PFN_DMEM, "DMEM" } static inline pfn_t __pfn_to_pfn_t(unsigned long pfn, u64 flags) { @@ -128,4 +131,16 @@ static inline bool pfn_t_special(pfn_t pfn) return false; } #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */ + +#ifdef CONFIG_ARCH_HAS_PTE_DMEM +static inline bool pfn_t_dmem(pfn_t pfn) +{ + return (pfn.val & PFN_DMEM) == PFN_DMEM; +} +#else +static inline bool pfn_t_dmem(pfn_t pfn) +{ + return false; +} +#endif /* CONFIG_ARCH_HAS_PTE_DMEM */ #endif /* _LINUX_PFN_T_H_ */ From patchwork Mon Dec 7 11:31:07 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955519 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BBC60C1B0D8 for ; Mon, 7 Dec 2020 11:38:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 921712333F for ; Mon, 7 Dec 2020 11:38:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727195AbgLGLfQ (ORCPT ); Mon, 7 Dec 2020 06:35:16 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40924 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727149AbgLGLfM (ORCPT ); Mon, 7 Dec 2020 06:35:12 -0500 Received: from mail-pf1-x42c.google.com (mail-pf1-x42c.google.com [IPv6:2607:f8b0:4864:20::42c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C93DFC0613D4; Mon, 7 Dec 2020 03:34:26 -0800 (PST) Received: by mail-pf1-x42c.google.com with SMTP id s21so9600828pfu.13; Mon, 07 Dec 2020 03:34:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=R3/DmgnhBXN4Hz69/t1CkODYALK4F+PnSmV8LC6JbSA=; b=i533nrfJosg5xurLfijup+2bdmgF+Al9N2r8hxZlMbkp+JC7XK4unPQu0mjh1PsG8f iRrrzCItR+iYj35Lj6OLsH2XeXSKTc9vtDLt82Om+/Qri/9Jof/xUBTWrvLM7VIfc1ja 9q9N/D8Fo/wp68q/S/DtBI1xr7NXgoezjHV7qYiHhpssHLqTvReszbbS2wmGP+GVzxdM ChEPA2iflLd0spt7SslHX2zIfs/7DCmxkXr3kDcBaa01uerLz6FraqlK+JHb8L1qFieK ZeAa+v+hd0ftuYQfi4Lo/TJdTWxi9w7I6yqaeACSWEArGkEl6yRcCrp4s1/tCgml3G3i Pwmw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=R3/DmgnhBXN4Hz69/t1CkODYALK4F+PnSmV8LC6JbSA=; b=BV4safdpkUYo8IqyHDxx1vgvUCvjngEwTTzC4yskjqPFxmjPlZxfzaBYkG1X20vhIb OtB+cW8KUPLl2WRzWk3YEzv7xZkCejx4Egy6OzGrsm95Z35Y4PTHM+mMGKJB3u09cQ+h EwFtckzWbIYRD4J/PYa9Mwkisu3w61gy9bweOsfPsfFias4nOwOl0mEvNiLTlUI5Bu5G dl6ulWyDVlRlYcxOkoA6NZC8LZ6Qcb1bjNoJWdYabiTx5u7h0K6wHyMwXJbtriHV6oGK 4t0LItYjf2EJVDGZK9YS7ei99J1I+32MJtFfQiqHU3+1JqQHJM47nZZG595dO4KIjKTh 7bfg== X-Gm-Message-State: AOAM532BhyjAWTIh4Ortvn3xrepZUOCzGj7S5Ci4wttr+hwQgN3gkBUk KP3zjI4uO5LvBDQVxMh8JKk= X-Google-Smtp-Source: ABdhPJz88HczwHM2v2Saqr/8f+J+9LkiKsk1pqJ18lze4vI837EannlLkxaMIwLu/EjTW1jpYWYrvg== X-Received: by 2002:a63:4905:: with SMTP id w5mr17941895pga.124.1607340866415; Mon, 07 Dec 2020 03:34:26 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.34.23 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:34:25 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 14/37] mm, dmem: differentiate dmem-pmd and thp-pmd Date: Mon, 7 Dec 2020 19:31:07 +0800 Message-Id: <9e1413b30d1cd4777af732e0995a7e7a03baeea6.1607332046.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang A dmem huge page is ultimately not a transparent huge page. As we decided to use pmd_special() to distinguish dmem-pmd from thp-pmd, we should make some slightly different semantics between pmd_special() and pmd_trans_huge(), just as pmd_devmap() in upstream. This distinction is especially important in some mm-core paths such as zap_pmd_range(). Explicitly mark the pmd_trans_huge() helpers that dmem needs by adding pmd_special() checks. This method could be reused in many mm-core paths. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- arch/x86/include/asm/pgtable.h | 10 +++++++++- include/linux/pgtable.h | 5 +++++ 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index dd4aff6..6ce85d4 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -259,7 +259,7 @@ static inline int pmd_large(pmd_t pte) /* NOTE: when predicate huge page, consider also pmd_devmap, or use pmd_large */ static inline int pmd_trans_huge(pmd_t pmd) { - return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE; + return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP|_PAGE_DMEM)) == _PAGE_PSE; } #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD @@ -275,6 +275,14 @@ static inline int has_transparent_hugepage(void) return boot_cpu_has(X86_FEATURE_PSE); } +#ifdef CONFIG_ARCH_HAS_PTE_DMEM +static inline int pmd_special(pmd_t pmd) +{ + return (pmd_val(pmd) & (_PAGE_SPECIAL | _PAGE_DMEM)) == + (_PAGE_SPECIAL | _PAGE_DMEM); +} +#endif + #ifdef CONFIG_ARCH_HAS_PTE_DEVMAP static inline int pmd_devmap(pmd_t pmd) { diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 9e65694..30342b8 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1162,6 +1162,11 @@ static inline pmd_t pmd_mkdmem(pmd_t pmd) { return pmd; } + +static inline int pmd_special(pmd_t pmd) +{ + return 0; +} #endif #ifndef pmd_read_atomic From patchwork Mon Dec 7 11:31:08 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955523 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E75BAC1B087 for ; Mon, 7 Dec 2020 11:38:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B44BB233CE for ; Mon, 7 Dec 2020 11:38:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727162AbgLGLfL (ORCPT ); Mon, 7 Dec 2020 06:35:11 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40936 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727149AbgLGLfK (ORCPT ); Mon, 7 Dec 2020 06:35:10 -0500 Received: from mail-pf1-x432.google.com (mail-pf1-x432.google.com [IPv6:2607:f8b0:4864:20::432]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 51896C061A52; Mon, 7 Dec 2020 03:34:30 -0800 (PST) Received: by mail-pf1-x432.google.com with SMTP id i3so6028315pfd.6; Mon, 07 Dec 2020 03:34:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=V0y+XXlhGalJMcAfPtOZoEyFt4J5WODB5ittFdj4EUc=; b=rWRkAEepoZ2AAXIS01SYEF9ocTxtc5cj/PFyxRiP37W7nq7Baq7lKLqQ1sovJAs//a d1zGmvUF7iz+BGuUZQV/x3/OWgGxZ4E7wXvRZkhRUY/MccWgFzrVjp+JaxSQpMhgj3S/ ETYvw6meqsHRfGqjL0+qIm9fYduavZsTZkx11ljgBhgD+PZktQFe4FiX8UwISy1teSLt x10aQxM2etDIKPUlcc+xzqN2U3igNRSiFUiFiZ7DWpXSC1epub1tWNPs1sKaUeVMftWu g19anHRd+XMhMmIs4xMVQKHWUmqjfinlCw5B+IgBh6gxXvlvHMT+PYd+rjI2vYOikyBZ Ym4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=V0y+XXlhGalJMcAfPtOZoEyFt4J5WODB5ittFdj4EUc=; b=GKK+wFsRmHek6h+RycX/otK+WdB76cfnGkG372DdTy+YEYKIgQbkCd29t3CqoDS6Zd 932OrL8Nx/XSyva+XgRX4K5BQQtfUnByUHy/XjeSE5tpfePUf7v/a8ntrv4Min0UayyZ Px7NN9vN2gu1Yc8e8l3f/+rJZcclW9OrwgbdoduJ0W4KcEe7a+4rewqTeKzGn0RILlDt NWDwqCYrg4qYEGfv/xn3ESYhtMzBJV9k3lC4xBbXIMxFq+Gw1Bp3z/1BZwCbsTBULSGw JaZHPCn7L6EYZimhbELOn+dpKu1FHcgeRhK6ERNyJ5JBXEtQzBu2MhYSOsDyh4a0tedr /Q6Q== X-Gm-Message-State: AOAM530L63VBVTDXFrKO6//CNceOqU8nyQ4RKbU3H/54DpGV90I9Jk2S Ins1dPf8+9nDOOaqYUanNL0= X-Google-Smtp-Source: ABdhPJxxb4HMiIpDPIafiIQODWbFpWWDR0MIcugLliMblGl9ynQK2cghFVJD2Y7Luh2Xq+P1vMYFgw== X-Received: by 2002:a63:a551:: with SMTP id r17mr2997461pgu.13.1607340869896; Mon, 07 Dec 2020 03:34:29 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.34.26 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:34:29 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 15/37] mm: add pmd_special() check for pmd_trans_huge_lock() Date: Mon, 7 Dec 2020 19:31:08 +0800 Message-Id: <789d8a9a23887c20e4966b6e1c9b52a320ab87af.1607332046.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang As dmem-pmd had been distinguished from thp-pmd, we need to add pmd_special() such that pmd_trans_huge_lock could fetch ptl for dmem huge pmd and treat it as stable pmd. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- include/linux/huge_mm.h | 3 ++- mm/huge_memory.c | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 0365aa9..2514b90 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -242,7 +242,8 @@ static inline int is_swap_pmd(pmd_t pmd) static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma) { - if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) + if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) + || pmd_devmap(*pmd) || pmd_special(*pmd)) return __pmd_trans_huge_lock(pmd, vma); else return NULL; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 9474dbc..31f9e83 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1890,7 +1890,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma) spinlock_t *ptl; ptl = pmd_lock(vma->vm_mm, pmd); if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || - pmd_devmap(*pmd))) + pmd_devmap(*pmd) || pmd_special(*pmd))) return ptl; spin_unlock(ptl); return NULL; From patchwork Mon Dec 7 11:31:09 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955517 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 382C3C2BB40 for ; Mon, 7 Dec 2020 11:37:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0A78A233A0 for ; Mon, 7 Dec 2020 11:37:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727218AbgLGLfR (ORCPT ); Mon, 7 Dec 2020 06:35:17 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40948 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726769AbgLGLfO (ORCPT ); Mon, 7 Dec 2020 06:35:14 -0500 Received: from mail-pf1-x443.google.com (mail-pf1-x443.google.com [IPv6:2607:f8b0:4864:20::443]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D4D76C0613D2; Mon, 7 Dec 2020 03:34:33 -0800 (PST) Received: by mail-pf1-x443.google.com with SMTP id s21so9601217pfu.13; Mon, 07 Dec 2020 03:34:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=TR3cC+AByvGn3JshaPXKc+3rxof2L0xMPbaiJBaNE0A=; b=PwcAOe4heCxF7XTuomjeUtwKezkdJmM0HmTX1yw4AeADa+3Z5jYH2gnncSNx6tF/dK nqc3OY3XklNb6fsWrMNV9SLhYauoxenHh7jF/LUEVNXitOpU59OZoN9JC+QCR46jU/Ns sBCj+XDvRx916ayOQcnxhHqZ/yh9ih/c9sVknDZkZs/LBEO/v5jpO2wZ1oEm3FfjX0hE TbWTBt9N8W/HGCkWK2IvBBhkklXYtf5Ke1t4gjU8zkkB1O5YL8vGHwUoJwXrPE3CU4Q8 1EfVWBeVZsU6g44z81Jf/b0p33RSG/Hwm+qr6c5NqBuzkeVYX26s7OBZzLc31WX/Gbiy jVjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=TR3cC+AByvGn3JshaPXKc+3rxof2L0xMPbaiJBaNE0A=; b=UPfisDrJQYj2PhVclw+UNTNBs7Zg/k0g8ZWOLVPrdXoAgkKDWs2e1mzjOwxtyPINcI S5DWvvCkUAhUKcJOXn4kDDNhGt8o3xruXpmNqcGMlItO9Zejm6GXgyNfxveqIsAuglRD mKMNuxbg9asxxkGsIV/fuvZoUoQVAfDZqhcZfr1hiSHTImdBtMe37j83j7zapNvkZ4Gw c0LP0MG2cshqszEV+/5kTQf6YrGv8KEyiBnEVQ53wzXXei1Qi8JUf/qdyi/oCrvGvMHA MFcRhPiqFswuiF78nCT70zu33SkHKtl0X3axh6FZjX7YlAAXAQbXak2KRsj9ian6V1ku VzsA== X-Gm-Message-State: AOAM531cGx60hY4bYmmm3usdZClEazgnEz7UiBUGh6iKWTeQ+RzRkF+F QYSmZBLB+DA7ycuX+7Q1ajk= X-Google-Smtp-Source: ABdhPJwrnzftH6wwtqZGjpZb2mr88aLorhMPJcvx0VqVsyAFRqovNa0wPGz47Osa45K8P+P4zlb4ww== X-Received: by 2002:a63:e20:: with SMTP id d32mr18339042pgl.94.1607340873505; Mon, 07 Dec 2020 03:34:33 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.34.30 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:34:32 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 16/37] dmemfs: introduce ->split() to dmemfs_vm_ops Date: Mon, 7 Dec 2020 19:31:09 +0800 Message-Id: <6b3c166a8d5827a1f6f2a33d85feae1c1633a45d.1607332046.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang It is required by __split_vma() to adjust vma. munmap() which create hole unaligned to pagesize in dmemfs-mapping should be forbidden. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- fs/dmemfs/inode.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index 443f2e1..ab6a492 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -450,6 +450,13 @@ static bool check_vma_access(struct vm_area_struct *vma, int write) return len; } +static int dmemfs_split(struct vm_area_struct *vma, unsigned long addr) +{ + if (addr & (dmem_page_size(file_inode(vma->vm_file)) - 1)) + return -EINVAL; + return 0; +} + static vm_fault_t dmemfs_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; @@ -484,6 +491,7 @@ static unsigned long dmemfs_pagesize(struct vm_area_struct *vma) } static const struct vm_operations_struct dmemfs_vm_ops = { + .split = dmemfs_split, .fault = dmemfs_fault, .pagesize = dmemfs_pagesize, .access = dmemfs_access_dmem, From patchwork Mon Dec 7 11:31:10 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955515 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E373FC1B0D9 for ; Mon, 7 Dec 2020 11:37:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B8BB5233A1 for ; Mon, 7 Dec 2020 11:37:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727281AbgLGLfV (ORCPT ); Mon, 7 Dec 2020 06:35:21 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40958 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727149AbgLGLfR (ORCPT ); Mon, 7 Dec 2020 06:35:17 -0500 Received: from mail-pf1-x42a.google.com (mail-pf1-x42a.google.com [IPv6:2607:f8b0:4864:20::42a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 767D3C0613D1; Mon, 7 Dec 2020 03:34:37 -0800 (PST) Received: by mail-pf1-x42a.google.com with SMTP id d2so5721926pfq.5; Mon, 07 Dec 2020 03:34:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=L3L3i0S6/nYZYxaW9mSnagt8eN+myZy0Uq4CS77lTAc=; b=M9bgqlb46kjnGVIXaRr/XrGDP6bC/rLRPYv7m3v/0Tiu7JDV84U0SZvFNZ7h9g0mEN bBxITDw2c/WSRAvLCXDehd3vLLN3xOZRGtx8xAdqFPlUtgf0NDSnl/cm4DQU6HYdhqnj XIE3zBNV6Shsi00GD7rhQ9OGpYBEFRA6EJNcBupMSM4kBJ3vqmJbFJ17uokCJLONmFvj j4aheVme/XXOMbJUFyX8F9Y/uo/0tsBnYn38LCXo3jZ2xdS+c3YRz1/eIOr1BvC5G/96 NWHaYdsbBaoLmjFo6L3oHM36BEKSsWUTu6v9jm+USREYts6euWqrB7FGFAV22r22MwIS C6Gg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=L3L3i0S6/nYZYxaW9mSnagt8eN+myZy0Uq4CS77lTAc=; b=PIpOLtjXtrNPln++qskdsOI2faEjIOAYIbcM60aI4+bzQ5uV3y9FTQvj8z/JlrFB1v 7Dn56AMP+MY7hFCQvhhJ16CdoJFybYFW/ZsDphqvk/UVxSnj7Sc3+ZOIpFNwnawq+pKe bPWRyFljNwBiGYVchd5bB6cQCYQt7HIBoPll2B9Xtp/KtGQZen+hxDgzbT8vdpG/I4OC 9PM3PxDrTQBgTNKVrfkwoI/v3BFrgKXsbpZDn2lKah8DXQl+0follkBySq9IQC0MrDeI uvuwtKrvezj0UF36hvGuzL6q7aXaGSA6ODZx0wcclYxUw521r3hIBcsu06fcsqxqatWR hnNg== X-Gm-Message-State: AOAM5323SpCsAN/2yxKnJxaLH9uAulB+zMY8sZpGn5w0pqyMsYlLtjue EJXqIAbPawFXlDZ9zqN7jj4= X-Google-Smtp-Source: ABdhPJxAoj+fbea/saUCeXAmigtbj1zZV+shVlXg+5snAfgXq3xCMvRxVvmTAbjuFi907z6lXAIqyA== X-Received: by 2002:a63:494f:: with SMTP id y15mr17974005pgk.364.1607340877090; Mon, 07 Dec 2020 03:34:37 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.34.33 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:34:36 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 17/37] mm, dmemfs: support unmap_page_range() for dmemfs pmd Date: Mon, 7 Dec 2020 19:31:10 +0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang It is required by munmap() for dmemfs mapping. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- mm/huge_memory.c | 2 ++ mm/memory.c | 8 +++++--- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 31f9e83..2a818ec 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1664,6 +1664,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, spin_unlock(ptl); if (is_huge_zero_pmd(orig_pmd)) tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE); + } else if (pmd_special(orig_pmd)) { + spin_unlock(ptl); } else if (is_huge_zero_pmd(orig_pmd)) { zap_deposited_table(tlb->mm, pmd); spin_unlock(ptl); diff --git a/mm/memory.c b/mm/memory.c index c48f8df..6b60981 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1338,10 +1338,12 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); - if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { - if (next - addr != HPAGE_PMD_SIZE) + if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || + pmd_devmap(*pmd) || pmd_special(*pmd)) { + if (next - addr != HPAGE_PMD_SIZE) { + VM_BUG_ON(pmd_special(*pmd)); __split_huge_pmd(vma, pmd, addr, false, NULL); - else if (zap_huge_pmd(tlb, vma, pmd, addr)) + } else if (zap_huge_pmd(tlb, vma, pmd, addr)) goto next; /* fall through */ } From patchwork Mon Dec 7 11:31:11 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955513 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A9C28C1B087 for ; Mon, 7 Dec 2020 11:37:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 640662333F for ; Mon, 7 Dec 2020 11:37:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727348AbgLGLfY (ORCPT ); Mon, 7 Dec 2020 06:35:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40970 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727288AbgLGLfV (ORCPT ); Mon, 7 Dec 2020 06:35:21 -0500 Received: from mail-pf1-x443.google.com (mail-pf1-x443.google.com [IPv6:2607:f8b0:4864:20::443]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 582A7C0613D3; Mon, 7 Dec 2020 03:34:41 -0800 (PST) Received: by mail-pf1-x443.google.com with SMTP id w6so9618883pfu.1; Mon, 07 Dec 2020 03:34:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=b/DFMWcoqrBGhrdQgSqqomqzF7zS5T+kFvHPOchokBk=; b=R9NM+LrICITDCSs61NAJazJcQQLLZOkAyWgF91yroQ30cXHAJppkpODFQHCiPkKmQn Di7C72F9xbWBRqscYUPkTutsiEPfArp2uJQhUJMFrFwCQvS74l6dGAFEUlRjW/Ws/WLb i25GvziBXB93UcCddRs/vcJrzZtwz/5Tc+/VbAAIFz9dRUGeAp5DTeSbJkdcv7XB2Gwx 3sN5hKnAOKf+OmoM35ez+biBZKuU9mmCC3Z0vH8q0qPs6i2+ilvJACrd43BNKlUYdfn+ zXXNnE5OgQCQ2GjXJLLtVka3prnPTyUz7X2Y7d5VNYlzQNmiGKs5m0h9G1bEKMENRYYK JILg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=b/DFMWcoqrBGhrdQgSqqomqzF7zS5T+kFvHPOchokBk=; b=AM33PIsfFtMVq/9u+0wAHHrcqcwwQJsk0RurgnP0Odxco1X7+UVwSkBXe3JSVd2pGu UCRBcGA6IMswvlsRJcWKGQ/zXDpgksu3kmASpR26GJY1FaT1c7oaPcIg/zUfsH8cpeYP 5rbLi4Ums9MzE+wV2Zg8iWlI1TNPCtnYKb1woUmG+xV0BDxR5KIVQzERYDlFPTulMX3l dmDzj5vh39uKM3d5ZHNGdxZu1uTvz/ZfGq7O7kaXKX26RR+VdzO49WhTE4lcQt6oreHS My9BzgqsPd+7HbeU0PcAPByYGZhmOzIdpFF71izW76mwdN2MYHMt3KDDq6yRRftuDddL 5gmw== X-Gm-Message-State: AOAM533e9Dz2Fr+qhkMaWOs6rw2lLa8rqdGO1rLXfnGqS5E+ItNxgLgX tIBcbEM42dZY4TfDie7Vxpk= X-Google-Smtp-Source: ABdhPJwpLJSMFHHVuzFaNJtee4l+tKxhbMqfSGyZXpTjn8Xdsw0wr9LXWXlJyzhmY0qsdp5ekgOemg== X-Received: by 2002:aa7:928c:0:b029:19a:de9d:fb11 with SMTP id j12-20020aa7928c0000b029019ade9dfb11mr15718593pfa.21.1607340880960; Mon, 07 Dec 2020 03:34:40 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.34.37 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:34:39 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 18/37] mm: follow_pmd_mask() for dmem huge pmd Date: Mon, 7 Dec 2020 19:31:11 +0800 Message-Id: <1401155e1db8221b892fb935204ad2d358c2808f.1607332046.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang While follow_pmd_mask(), dmem huge pmd should be recognized and return error pointer of '-EEXIST' to indicate that proper page table entry exists in pmd special but no corresponding struct page, because dmem page means non struct page backend. We update pmd if foll_flags takes FOLL_TOUCH. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- mm/gup.c | 42 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) diff --git a/mm/gup.c b/mm/gup.c index 98eb8e6..ad1aede 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -387,6 +387,42 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, return -EEXIST; } +static struct page * +follow_special_pmd(struct vm_area_struct *vma, unsigned long address, + pmd_t *pmd, unsigned int flags) +{ + spinlock_t *ptl; + + if ((flags & FOLL_DUMP) && is_huge_zero_pmd(*pmd)) + /* Avoid special (like zero) pages in core dumps */ + return ERR_PTR(-EFAULT); + + /* No page to get reference */ + if (flags & FOLL_GET) + return ERR_PTR(-EFAULT); + + if (flags & FOLL_TOUCH) { + pmd_t _pmd; + + ptl = pmd_lock(vma->vm_mm, pmd); + if (!pmd_special(*pmd)) { + spin_unlock(ptl); + return NULL; + } + _pmd = pmd_mkyoung(*pmd); + if (flags & FOLL_WRITE) + _pmd = pmd_mkdirty(_pmd); + if (pmdp_set_access_flags(vma, address & HPAGE_PMD_MASK, + pmd, _pmd, + flags & FOLL_WRITE)) + update_mmu_cache_pmd(vma, address, pmd); + spin_unlock(ptl); + } + + /* Proper page table entry exists, but no corresponding struct page */ + return ERR_PTR(-EEXIST); +} + /* * FOLL_FORCE can write to even unwritable pte's, but only * after we've gone through a COW cycle and they are dirty. @@ -571,6 +607,12 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, return page; return no_page_table(vma, flags); } + if (pmd_special(*pmd)) { + page = follow_special_pmd(vma, address, pmd, flags); + if (page) + return page; + return no_page_table(vma, flags); + } if (is_hugepd(__hugepd(pmd_val(pmdval)))) { page = follow_huge_pd(vma, address, __hugepd(pmd_val(pmdval)), flags, From patchwork Mon Dec 7 11:31:12 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955509 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77D6BC2BBCD for ; Mon, 7 Dec 2020 11:37:30 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4DBB82333F for ; Mon, 7 Dec 2020 11:37:30 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727377AbgLGLf0 (ORCPT ); Mon, 7 Dec 2020 06:35:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40980 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727303AbgLGLfZ (ORCPT ); Mon, 7 Dec 2020 06:35:25 -0500 Received: from mail-pf1-x434.google.com (mail-pf1-x434.google.com [IPv6:2607:f8b0:4864:20::434]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D63C7C061A4F; Mon, 7 Dec 2020 03:34:44 -0800 (PST) Received: by mail-pf1-x434.google.com with SMTP id 131so9600683pfb.9; Mon, 07 Dec 2020 03:34:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=PsVgOpQIgSKKX9sKyFMtyV6DCXy6X7BZeXZh0AyMkHE=; b=QMmaSsp/kPkYAOoJQsM9EoXqVH9lanjFe06YoMyS63YE92kVOJCcMfFXgEA7yjn87P y28+hU0OuDWvBX7SY2rxnQlWNrE6D+KzZK9M4ko1y+IRrfeYC8ZoXYQ5j/FHutEdRnUA CFnWcAyRmrIXQyQXqs6RE+G5CKAhGupvDHeYaQpkizYK/LR7w2A0MDaQ4i13inwSb1f4 eMV/GFBnZ9S+DZABkEr6AOMkeB8HeEP6uKXvczuBshAKSC1DzmYlsaa6DQXz6YZzMVS5 wqQYorLHoqsJF9fPQ4cEzpX39pO4YzMWouHhku6ju81R+o5U+fSNpP/dEc35m36/1jNg h5Jg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=PsVgOpQIgSKKX9sKyFMtyV6DCXy6X7BZeXZh0AyMkHE=; b=SkbDFoSSJul67cdFkum1nIGzKzra0pYFXPX5CvbicepGJqTO/LUgryv0srH3CfsTkJ 03rYXVSabe6c3UN4vdAO9b81+hnVEGIONvQLt2qvNSplyQrV+s8eHAy4qxurrxMD3FUl iweDGmK0oaMAyKGrfGUH+WnxKNzF7nbWnGxvUzokuCLP6mBCht29ZQFlJpblLiWOKYj7 d2tgqZobVrD7EPusGsKhpY9O32LeXmZ0xqq2vCBFCjdW1b0f1lF58SWFPZV3MZdxtifA PHlNhLddj7Cl3v3u1IPK0/yrFRIZDj4ucOHtyqV7aaBolVyLIvagM8fdGCWJYEwTOOP1 GkeQ== X-Gm-Message-State: AOAM5334Ic4Wt9FENMjvCOs7mAxy0Lm+N2tYvsk3EIAySto1uwecgb+W xyatGRLORSu4EG+wSrZbiF0= X-Google-Smtp-Source: ABdhPJynrUsQxj48xNiMU/huEKvcNjt/ObGmP+cORAqqJJ7WrkW6MqDOQEZH+sv1gpYZpKvKfsdZ0g== X-Received: by 2002:a62:6d06:0:b029:19d:9728:2b71 with SMTP id i6-20020a626d060000b029019d97282b71mr15195431pfc.69.1607340884475; Mon, 07 Dec 2020 03:34:44 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.34.41 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:34:44 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 19/37] mm: gup_huge_pmd() for dmem huge pmd Date: Mon, 7 Dec 2020 19:31:12 +0800 Message-Id: <1a8eaaf72af4bd98c8fa1a90d36a64612f7c14b0.1607332046.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang Add pmd_special() check in gup_huge_pmd() to support dmem huge pmd. GUP will return zero if enconter dmem page, and we could handle it outside GUP routine. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- mm/gup.c | 6 +++++- mm/pagewalk.c | 2 +- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index ad1aede..47c8197 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2470,6 +2470,10 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, if (!pmd_access_permitted(orig, flags & FOLL_WRITE)) return 0; + /* Bypass dmem huge pmd. It will be handled in outside routine. */ + if (pmd_special(orig)) + return 0; + if (pmd_devmap(orig)) { if (unlikely(flags & FOLL_LONGTERM)) return 0; @@ -2572,7 +2576,7 @@ static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned lo return 0; if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) || - pmd_devmap(pmd))) { + pmd_devmap(pmd) || pmd_special(pmd))) { /* * NUMA hinting faults need to be handled in the GUP * slowpath for accounting purposes and so that they diff --git a/mm/pagewalk.c b/mm/pagewalk.c index e81640d..e7c4575 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -71,7 +71,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end, do { again: next = pmd_addr_end(addr, end); - if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma)) { + if (pmd_none(*pmd) || (!walk->vma && !walk->no_vma) || pmd_special(*pmd)) { if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); if (err) From patchwork Mon Dec 7 11:31:13 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955505 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3F942C2BB40 for ; Mon, 7 Dec 2020 11:37:30 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 04CE0233A0 for ; Mon, 7 Dec 2020 11:37:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727378AbgLGLf3 (ORCPT ); Mon, 7 Dec 2020 06:35:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40992 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727392AbgLGLf2 (ORCPT ); Mon, 7 Dec 2020 06:35:28 -0500 Received: from mail-pg1-x542.google.com (mail-pg1-x542.google.com [IPv6:2607:f8b0:4864:20::542]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4080FC061A51; Mon, 7 Dec 2020 03:34:48 -0800 (PST) Received: by mail-pg1-x542.google.com with SMTP id w16so8640487pga.9; Mon, 07 Dec 2020 03:34:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=irx3d3NJeQeYfZPwYnMd1dXXT7HLdVgOj1onbLngswY=; b=eJYAcqAfXe8LOuaHKdKllKlDVNqxancjE8M5qAXrhETqb6j0v9x0H6NmUSzw8StI14 J2rM/AstiKHgOTM5AVfKtlpjtk+Fy+75hlmDPxd9jG4tEq1SyKQGWIYTm7uq/2uxcFlg W65vIQCLcSGlLabBaPbAuQhsHwJbWDWT8I5hWT/k/rJ7sxp6PR9YsdF4n62gC431AOpJ aIaVvanL5902eANwLRH1qq5hk7Snj3uV5PeZ7O6x1IM11CDxzRxpQlAinRhXhnHtReTC Trs4VME+idkpV6820spwytsexIK78pl4wM0D44rVyvONHPP9rysLFUxoF40Jm3dYeJDg jyOQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=irx3d3NJeQeYfZPwYnMd1dXXT7HLdVgOj1onbLngswY=; b=OdS9NvdLbukiIbZilS71U0WElWLNygdBtboFgvJfrDYM8xZl0BRw2MXFeHcAiLfKkr Pse3wJVMiDrZXI3RxpAKqSWzIeYN/zRDc4TS9e81e1wzFUswNugfophM9HLLmiLGSI+g WiiyOUrZztL1TyNLCLfCPrianaLWW93aARzqXcA3q/WdY5yR6Vyb5VJA0fB6dkdTx4Ho /JBtoRnsN8o2z8VrsbSvXX8BIFTotp96iUKkeBIH68h5et51A5A6oYeOeVvigWcq6vZ9 bV31G8M3xD3sEE8qRaR49b1DcsT5YZnN+1SQ3aAwwlojlwaAgi+OczCsNkGlIEd0bkF+ yUTA== X-Gm-Message-State: AOAM531kz5LLAVjNZITm/fNuA62p7vcGweqNvehHilS/CbcR/tZai42G yD6ydOrcR5lD0y8/r3qmZPA= X-Google-Smtp-Source: ABdhPJy7T+zOWSaaaUAy9bCSHgKA4JkL2mqiQ0XfQp/lTQid14KX+alS2W1IiCRqphNjeXsYneYbLA== X-Received: by 2002:a17:902:7b97:b029:d8:ec6e:5c28 with SMTP id w23-20020a1709027b97b02900d8ec6e5c28mr15836888pll.40.1607340887900; Mon, 07 Dec 2020 03:34:47 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.34.44 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:34:47 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 20/37] mm: support dmem huge pmd for vmf_insert_pfn_pmd() Date: Mon, 7 Dec 2020 19:31:13 +0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang Since vmf_insert_pfn_pmd will BUG_ON non-pmd-devmap, we make pfn dmem pass the check. Dmem huge pmd will be marked with _PAGE_SPECIAL and _PAGE_DMEM, so that follow_pfn() could recognize it. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- mm/huge_memory.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 2a818ec..6e52d57 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -781,6 +781,8 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, entry = pmd_mkhuge(pfn_t_pmd(pfn, prot)); if (pfn_t_devmap(pfn)) entry = pmd_mkdevmap(entry); + else if (pfn_t_dmem(pfn)) + entry = pmd_mkdmem(entry); if (write) { entry = pmd_mkyoung(pmd_mkdirty(entry)); entry = maybe_pmd_mkwrite(entry, vma); @@ -827,7 +829,7 @@ vm_fault_t vmf_insert_pfn_pmd_prot(struct vm_fault *vmf, pfn_t pfn, * can't support a 'special' bit. */ BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && - !pfn_t_devmap(pfn)); + !pfn_t_devmap(pfn) && !pfn_t_dmem(pfn)); BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) == (VM_PFNMAP|VM_MIXEDMAP)); BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags)); From patchwork Mon Dec 7 11:31:14 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955493 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id ACF63C1B0D9 for ; Mon, 7 Dec 2020 11:37:01 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 82AD7233A0 for ; Mon, 7 Dec 2020 11:37:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727480AbgLGLfj (ORCPT ); Mon, 7 Dec 2020 06:35:39 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41004 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727466AbgLGLfh (ORCPT ); Mon, 7 Dec 2020 06:35:37 -0500 Received: from mail-pg1-x529.google.com (mail-pg1-x529.google.com [IPv6:2607:f8b0:4864:20::529]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8B298C061A54; Mon, 7 Dec 2020 03:34:51 -0800 (PST) Received: by mail-pg1-x529.google.com with SMTP id w4so8627283pgg.13; Mon, 07 Dec 2020 03:34:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=DaKSoEd8oO/EoBIYBO4UbeyIy7f7hbBSPRHqMQ3AnvY=; b=PzhKbqA5fZx1L/DdeOWbYK9XMQgh6O7ZLstwldRGSShPdyufij9rEkyE6yyPtqiQFR DvWZPV6HXWx47aXdj0c5QGrJYx5jyAb/hJu6ZyCLoid/KOSX4lnYuS1fkyC5qERNn/or fi52i7cA0Ydltn8kGuWGFLJllgTdoSaKxXa27xZA7ACXlEmO/koMkREMuXdB83dzkDs3 bplK/YxucWjx5cD+h5V/dR0hD0kyXDZPv5eYvSZXBUPl8fSplms6k6vGbok9ayTpD/3z aXsvfT7CwDAq3W8zodGBCvxkyjKmS/58WaE/6abU2h1kr6g7+ePfVSLGuuKFpsJcEC4S +BmA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=DaKSoEd8oO/EoBIYBO4UbeyIy7f7hbBSPRHqMQ3AnvY=; b=oy2liVx1AWp8wIDUYNAiD4J1aMSGk9A+3gZvYM1ePUdxo5wPinjgZEZRvUUxQN51Ml I41Zk2Elu45GaJJtbiGTkXZWmOV5CSO/e0XSK+7XYOHvD/qWTRiiyRN0tqzsiZ/nsAu9 0f1maZnmpUzNzfAoj1ippZ0+Fz8mkE80JF4+Q5pXLyE2fBooJpwCqi1r+bKh0ZG4YB4t xZEhIwzqU8I+FgxXdM/vNkb73NaMl5+xLGyBprAu4Zwi+zJeK2BodgPEIzlPUC6YnF1v 6zTwnh6VifJWbprvxNS1LrsK4N95P2ba27S5OYaSAbgs0RNUrFcAYpDPO3g60EmLiCKn 2R1g== X-Gm-Message-State: AOAM530q86wxVOLN5FBcKztpkrHTrkfud/O6txoxYpoMW707pzqLqmDk NLFViHXEfdswHPctycchhBA= X-Google-Smtp-Source: ABdhPJyqqW0fQyetyIn3XyHRcLkK2zOeQJNTeAqGI4VkfBmrNDEhNpiPMRzTChGwx86K9TucHV5VBg== X-Received: by 2002:a17:902:7144:b029:da:7268:d730 with SMTP id u4-20020a1709027144b02900da7268d730mr15412723plm.20.1607340891184; Mon, 07 Dec 2020 03:34:51 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.34.48 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:34:50 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 21/37] mm: support dmem huge pmd for follow_pfn() Date: Mon, 7 Dec 2020 19:31:14 +0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang follow_pfn() will get pfn of pmd if huge pmd is encountered. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- mm/memory.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 6b60981..abb9148 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4807,15 +4807,23 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address, int ret = -EINVAL; spinlock_t *ptl; pte_t *ptep; + pmd_t *pmdp = NULL; if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) return ret; - ret = follow_pte(vma->vm_mm, address, &ptep, &ptl); + ret = follow_pte_pmd(vma->vm_mm, address, NULL, &ptep, &pmdp, &ptl); if (ret) return ret; - *pfn = pte_pfn(*ptep); - pte_unmap_unlock(ptep, ptl); + + if (pmdp) { + *pfn = pmd_pfn(*pmdp) + ((address & ~PMD_MASK) >> PAGE_SHIFT); + spin_unlock(ptl); + } else { + *pfn = pte_pfn(*ptep); + pte_unmap_unlock(ptep, ptl); + } + return 0; } EXPORT_SYMBOL(follow_pfn); From patchwork Mon Dec 7 11:31:15 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955353 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B7043C2BBCD for ; Mon, 7 Dec 2020 11:35:53 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8F0B8233FC for ; Mon, 7 Dec 2020 11:35:53 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727530AbgLGLft (ORCPT ); Mon, 7 Dec 2020 06:35:49 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41034 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727466AbgLGLfo (ORCPT ); Mon, 7 Dec 2020 06:35:44 -0500 Received: from mail-pg1-x533.google.com (mail-pg1-x533.google.com [IPv6:2607:f8b0:4864:20::533]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1C8F3C08E864; Mon, 7 Dec 2020 03:34:55 -0800 (PST) Received: by mail-pg1-x533.google.com with SMTP id o5so8647152pgm.10; Mon, 07 Dec 2020 03:34:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=cfW0rkOgMIuxNIUMjXUU1M20WeHJvF5wHx5SDbaa7WE=; b=tAKAgseG2CCONQ+JBLfVPuz/DEVm6e6uJwCvBS1RREKrlZrlOUdzYsuLEUhueHfOFF uFV+57rP3yHKObWXA15tn3JY9Y2Mp7P0iPIKN8JWG9DeMFVnD03wWM3MnRRZiJQT6AF2 WR2vxkVxf0MNXxhaYUqnCBkgNMZXajnPprZpauiW0v4cZtIPknD9A9IWTbwA3N/nERtu oiMqPUyiyheuzpFUWKvr/tZb1S1TJE2PoPMHS+zlAU086PYQuSRhzWuPFYiB7b1Acfcl yxBbTTVnn5ROFZOGNlweU2k6Hu7UN+pZY6i4XQztkumh8Gv00oB9pvKIHKFEHgAYZYi/ 4JrQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=cfW0rkOgMIuxNIUMjXUU1M20WeHJvF5wHx5SDbaa7WE=; b=FxBKjWXmRnz7ifCQVUGXFM8hPwkNTYIpfVWyk1OHP0/5FEAvpcMVZkKQfVWd4KBTa8 DcRoOqvkCkYfG6cW6TixeyrQ5+iA/4/xt3Pb0nQTrgdkxpMMMUJAjZZpZMk56SQrlYxN 3myAbITwHeVwdr4fKPhZVbF5MUhSmhgs7fJ/zBG6k1d9mqDiH9CJiRRJQwaABiVXQp55 ryhTWEADuVMkeARucipjmKwgS/xUt+90Y5hahGClLWJWOwsDU4ef75OHDXzICRdPkW8I mgHP3IqO96uSoG/s8HdsQHsTT+Y9sQBzscImiOjqwBbBJY9IKMBbUHnr1CI6TUEylA27 VKTQ== X-Gm-Message-State: AOAM530T2i+KOeU0kSYcF+BBEAAf6JGDs25AfptPdOLO0mKZpc+0hIE+ roXbAv0duBIOgN08Fm7B7zw= X-Google-Smtp-Source: ABdhPJwFz1981DhHI3jR9VGpOCOSaq+Ig5CXILTKJ1ERZg74A7VBmwCCkjGQ9zkn+U6XHw4ztmn59g== X-Received: by 2002:a17:902:5581:b029:da:a817:1753 with SMTP id g1-20020a1709025581b02900daa8171753mr15580441pli.76.1607340894707; Mon, 07 Dec 2020 03:34:54 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.34.51 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:34:54 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 22/37] kvm, x86: Distinguish dmemfs page from mmio page Date: Mon, 7 Dec 2020 19:31:15 +0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang Dmem page is pfn invalid but not mmio, introduce API is_dmem_pfn() to distinguish that. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- arch/x86/kvm/mmu/mmu.c | 1 + include/linux/dmem.h | 7 +++++++ mm/dmem.c | 7 +++++++ 3 files changed, 15 insertions(+) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 5bb1939..394508f 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -43,6 +43,7 @@ #include #include #include +#include #include #include diff --git a/include/linux/dmem.h b/include/linux/dmem.h index 8682d63..59d3ef14 100644 --- a/include/linux/dmem.h +++ b/include/linux/dmem.h @@ -19,11 +19,18 @@ unsigned int try_max, unsigned int *result_nr); void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr); +bool is_dmem_pfn(unsigned long pfn); #define dmem_free_page(addr) dmem_free_pages(addr, 1) #else static inline int dmem_reserve_init(void) { return 0; } + +static inline bool is_dmem_pfn(unsigned long pfn) +{ + return 0; +} + #endif #endif /* _LINUX_DMEM_H */ diff --git a/mm/dmem.c b/mm/dmem.c index 2e61dbd..eb6df70 100644 --- a/mm/dmem.c +++ b/mm/dmem.c @@ -972,3 +972,10 @@ void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr) } EXPORT_SYMBOL(dmem_free_pages); +bool is_dmem_pfn(unsigned long pfn) +{ + struct dmem_node *dnode; + + return !!find_dmem_region(__pfn_to_phys(pfn), &dnode); +} +EXPORT_SYMBOL(is_dmem_pfn); From patchwork Mon Dec 7 11:31:16 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955521 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B189DC0018C for ; Mon, 7 Dec 2020 11:38:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6EC45233A0 for ; Mon, 7 Dec 2020 11:38:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727260AbgLGLhz (ORCPT ); Mon, 7 Dec 2020 06:37:55 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40898 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727201AbgLGLfQ (ORCPT ); Mon, 7 Dec 2020 06:35:16 -0500 Received: from mail-pf1-x442.google.com (mail-pf1-x442.google.com [IPv6:2607:f8b0:4864:20::442]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C5D81C08E860; Mon, 7 Dec 2020 03:34:58 -0800 (PST) Received: by mail-pf1-x442.google.com with SMTP id c12so2530182pfo.10; Mon, 07 Dec 2020 03:34:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=3Q5MS2UXjUQqUjTpuPXnY5Lc/xhe4Tibwul6rnRwWL4=; b=nW7tHnXavpC5taczb2SrIKzWwWUMvoY4U2MTkN3vDPPis+BUGXKaGqPDCsIxEfquXK 6tmf6cEkdZTLCJdDXSh6MnNQCPBTV6kFB9HnhWkc91ozY0AAtLOOKkEMOqDBzYi4fqmn vakqjy46p1cUwbWXP/xi7QIAw9YzTSRnmgmnJrHoKXeAFo3V1L5iM3PHOBMX46KN5v3Z 34HgCx9eDmAlKhggnpFYM8ACTmm6p267BNhSJCIFO9+Der0ooC1Q9+xIQNrzo5WPJwHE 9owlGXBvga1yQ91B0OHtPxgNUCzR6oEuiJ6+Tvv4ZAklNhM3orxx6h4kzT/nmxGA3nGL lpuQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=3Q5MS2UXjUQqUjTpuPXnY5Lc/xhe4Tibwul6rnRwWL4=; b=AOlZ6/m/ax7skDD1F4UBxGDn8M4zj7sPPxO4GHdt7OdxRIld8VqbMZI60BFphocRwx oud65dcah1PW0xGUZZ133E3KUE9QVeMfai4XBoUqT93rsO3L85Ts2Cqc4HiDUlOfBRq2 gev7L+E8U8COcszGBE5y+fXUgMwfhbQe9UnZ2s5GCTOzPWuKLjQ7EtvNKtathNJ4ZACl JuYQefP7JYy2LtvlxC1Dy0YfZ6MlwqeMbxcg4MPoSXdvYbYjW+tZCGxcUw4e3Xxuk66K VjvRjbBezdyTlpeH20zeQx6vvOhlp1etxXW96aN6DDS9O/3fkjYKNE6J550l+kDdcH4R aaEw== X-Gm-Message-State: AOAM530U14SZL7oezcpOjXqTJGf/SDCdPRvaonzlo45rE72nlx/nbxzZ 8XYFWWqMR4eRjNOFSitazWk= X-Google-Smtp-Source: ABdhPJzvNvnKPK0JB4i5KGpQCvE8rYhnqcuhj31K/F7yMqFfiJ9DwwOo6y8xk/RvjaqCiew/ogf9Kg== X-Received: by 2002:a62:9205:0:b029:19d:bab0:ba17 with SMTP id o5-20020a6292050000b029019dbab0ba17mr15435970pfd.37.1607340898388; Mon, 07 Dec 2020 03:34:58 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.34.54 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:34:57 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 23/37] kvm, x86: introduce VM_DMEM for syscall support usage Date: Mon, 7 Dec 2020 19:31:16 +0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang Currently dmemfs do not support memory readonly, so change_protection() will be disabled for dmemfs vma. Since vma->vm_flags could be changed to new flag in mprotect_fixup(), so we introduce a new vma flag VM_DMEM and check this flag in mprotect_fixup() to avoid changing vma->vm_flags. We also check it in vma_to_resize() to disable mremap() for dmemfs vma. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- fs/dmemfs/inode.c | 2 +- include/linux/mm.h | 7 +++++++ mm/gup.c | 7 +++++-- mm/mincore.c | 8 ++++++-- mm/mprotect.c | 5 ++++- mm/mremap.c | 3 +++ 6 files changed, 26 insertions(+), 6 deletions(-) diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index ab6a492..b165bd3 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -507,7 +507,7 @@ int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma) if (!(vma->vm_flags & VM_SHARED)) return -EINVAL; - vma->vm_flags |= VM_PFNMAP; + vma->vm_flags |= VM_PFNMAP | VM_DMEM | VM_IO; file_accessed(file); vma->vm_ops = &dmemfs_vm_ops; diff --git a/include/linux/mm.h b/include/linux/mm.h index db6ae4d..2f3135fe 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -311,6 +311,8 @@ int overcommit_policy_handler(struct ctl_table *, int, void *, size_t *, #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ +#define VM_DMEM BIT(38) /* Dmem page VM */ + #ifdef CONFIG_ARCH_HAS_PKEYS # define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0 # define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit value */ @@ -666,6 +668,11 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma) return vma->vm_flags & VM_ACCESS_FLAGS; } +static inline bool vma_is_dmem(struct vm_area_struct *vma) +{ + return !!(vma->vm_flags & VM_DMEM); +} + #ifdef CONFIG_SHMEM /* * The vma_is_shmem is not inline because it is used only by slow diff --git a/mm/gup.c b/mm/gup.c index 47c8197..0ea9071 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -492,8 +492,11 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, goto no_page; } else if (unlikely(!page)) { if (flags & FOLL_DUMP) { - /* Avoid special (like zero) pages in core dumps */ - page = ERR_PTR(-EFAULT); + if (vma_is_dmem(vma)) + page = ERR_PTR(-EEXIST); + else + /* Avoid special (like zero) pages in core dumps */ + page = ERR_PTR(-EFAULT); goto out; } diff --git a/mm/mincore.c b/mm/mincore.c index 02db1a8..f8d10e4 100644 --- a/mm/mincore.c +++ b/mm/mincore.c @@ -78,8 +78,12 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end, pgoff_t pgoff; pgoff = linear_page_index(vma, addr); - for (i = 0; i < nr; i++, pgoff++) - vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff); + for (i = 0; i < nr; i++, pgoff++) { + if (vma_is_dmem(vma)) + vec[i] = 1; + else + vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff); + } } else { for (i = 0; i < nr; i++) vec[i] = 0; diff --git a/mm/mprotect.c b/mm/mprotect.c index 56c02be..b1650b5 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -236,7 +236,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, * for all the checks. */ if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) && - pmd_none_or_clear_bad_unless_trans_huge(pmd)) + pmd_none_or_clear_bad_unless_trans_huge(pmd) && !pmd_special(*pmd)) goto next; /* invoke the mmu notifier if the pmd is populated */ @@ -412,6 +412,9 @@ static int prot_none_test(unsigned long addr, unsigned long next, return 0; } + if (vma_is_dmem(vma)) + return -EINVAL; + /* * Do PROT_NONE PFN permission checks here when we can still * bail out without undoing a lot of state. This is a rather diff --git a/mm/mremap.c b/mm/mremap.c index 138abba..598e681 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -482,6 +482,9 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr, if (!vma || vma->vm_start > addr) return ERR_PTR(-EFAULT); + if (vma_is_dmem(vma)) + return ERR_PTR(-EINVAL); + /* * !old_len is a special case where an attempt is made to 'duplicate' * a mapping. This makes no sense for private mappings as it will From patchwork Mon Dec 7 11:31:17 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955511 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 06CC3C4361B for ; Mon, 7 Dec 2020 11:37:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B177D2333F for ; Mon, 7 Dec 2020 11:37:55 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727364AbgLGLh3 (ORCPT ); Mon, 7 Dec 2020 06:37:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40936 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727339AbgLGLf0 (ORCPT ); Mon, 7 Dec 2020 06:35:26 -0500 Received: from mail-pg1-x541.google.com (mail-pg1-x541.google.com [IPv6:2607:f8b0:4864:20::541]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6553AC08ED7E; Mon, 7 Dec 2020 03:35:02 -0800 (PST) Received: by mail-pg1-x541.google.com with SMTP id o5so8647403pgm.10; Mon, 07 Dec 2020 03:35:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=d/DMcQ5p9kWqt/8+2V3dACGpqFuk8uMzI8jd59zbvK8=; b=uDgPrdaW7QyZet4jLziwzIsT2FNC+muT99czm8HXIiXGHPutl3zp4sPeraiv07ll3r BQwlcD3KzLyJ9zAdfZ/NPlv4v98ZJhyXKeYwxNluZgjB08IT0zg+7sgClpKtweMJHL2i L6gt6Sx8ux6fk/NHOVacX/B4cu+FfYGthe2oAKzj2STci8jOmsss5yDhreWOWvLf3u2L c14M826QMvoUlMzulslcHONuIx1sPeUjQUjZ8rtHti7e48GSHl4Yi/RFWw5lNMrL8MSM k7CHx/oPR5izM2UgYMUgqFLnQKcs7gnrSlWrGV5pRWN31vfQBOKvYyB71NjPuk9WUbRK SBDw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=d/DMcQ5p9kWqt/8+2V3dACGpqFuk8uMzI8jd59zbvK8=; b=SNvGrhdKrLQ3z6bs+8Ienw+nvfVhgc4H1k9VZtT19J3BdnHritG5wKEB6al5g41LNx PkprroLsQc+9er20D0uR0clOiI92ScOqyV1pFdlsLZnA7rtHTjduwvGHavlZNQQiKZxg n7aRY4LK6WK/B5LxLdeCyeTdvivoOxz1HkpXrYIzaQlWzQBbDTb3B9knnPkww+Vp0WCb ITrR0FggSY368rNHGtnd8PeVyAIRDPn8vAWzGDFYnU9PmB51LVyfbT6A3QiGDGOndaWq +83mCrGmlWNnbq5nttwDXvkr5r4CA/onzl9HpIcHVPgp1rIsj009X9hnulpI+QCpAd8M 2esA== X-Gm-Message-State: AOAM530iD18M95CuLzD3dZR64S/UPhDoR+/tXDlPLQijj0S9Epdqw8cR dnSANbbpEsJMnS44iGXH+9o= X-Google-Smtp-Source: ABdhPJzsTfFyeDeSqCA2nVBB3th4mGCgQXxAf/7SC6+PJE0lBNEtPZT6GhKeWck8kJzv61kWa4w/EA== X-Received: by 2002:aa7:8b15:0:b029:196:59ad:ab93 with SMTP id f21-20020aa78b150000b029019659adab93mr15268290pfd.16.1607340901970; Mon, 07 Dec 2020 03:35:01 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.34.58 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:35:01 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 24/37] dmemfs: support hugepage for dmemfs Date: Mon, 7 Dec 2020 19:31:17 +0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang It add hugepage support for dmemfs. We use PFN_DMEM to notify vmf_insert_pfn_pmd, and dmem huge pmd will be marked with _PAGE_SPECIAL and _PAGE_DMEM. So that GUP-fast can separate dmemfs page from other page type and handle it correctly. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- fs/dmemfs/inode.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 111 insertions(+), 2 deletions(-) diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index b165bd3..17a518c 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -457,7 +457,7 @@ static int dmemfs_split(struct vm_area_struct *vma, unsigned long addr) return 0; } -static vm_fault_t dmemfs_fault(struct vm_fault *vmf) +static vm_fault_t __dmemfs_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct inode *inode = file_inode(vma->vm_file); @@ -485,6 +485,63 @@ static vm_fault_t dmemfs_fault(struct vm_fault *vmf) return ret; } +static vm_fault_t __dmemfs_pmd_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + unsigned long pmd_addr = vmf->address & PMD_MASK; + unsigned long page_addr; + struct inode *inode = file_inode(vma->vm_file); + void *entry; + phys_addr_t phys; + pfn_t pfn; + int ret; + + if (dmem_page_size(inode) < PMD_SIZE) + return VM_FAULT_FALLBACK; + + WARN_ON(pmd_addr < vma->vm_start || + vma->vm_end < pmd_addr + PMD_SIZE); + + page_addr = vmf->address & ~(dmem_page_size(inode) - 1); + entry = radix_get_create_entry(vma, page_addr, inode, + linear_page_index(vma, page_addr)); + if (IS_ERR(entry)) + return (PTR_ERR(entry) == -ENOMEM) ? + VM_FAULT_OOM : VM_FAULT_SIGBUS; + + phys = dmem_addr_to_pfn(inode, dmem_entry_to_addr(inode, entry), + linear_page_index(vma, pmd_addr), PMD_SHIFT); + phys <<= PAGE_SHIFT; + pfn = phys_to_pfn_t(phys, PFN_DMEM); + ret = vmf_insert_pfn_pmd(vmf, pfn, !!(vma->vm_flags & VM_WRITE)); + + radix_put_entry(); + return ret; +} + +static vm_fault_t dmemfs_huge_fault(struct vm_fault *vmf, enum page_entry_size pe_size) +{ + int ret; + + switch (pe_size) { + case PE_SIZE_PTE: + ret = __dmemfs_fault(vmf); + break; + case PE_SIZE_PMD: + ret = __dmemfs_pmd_fault(vmf); + break; + default: + ret = VM_FAULT_SIGBUS; + } + + return ret; +} + +static vm_fault_t dmemfs_fault(struct vm_fault *vmf) +{ + return dmemfs_huge_fault(vmf, PE_SIZE_PTE); +} + static unsigned long dmemfs_pagesize(struct vm_area_struct *vma) { return dmem_page_size(file_inode(vma->vm_file)); @@ -495,6 +552,7 @@ static unsigned long dmemfs_pagesize(struct vm_area_struct *vma) .fault = dmemfs_fault, .pagesize = dmemfs_pagesize, .access = dmemfs_access_dmem, + .huge_fault = dmemfs_huge_fault, }; int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma) @@ -507,15 +565,66 @@ int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma) if (!(vma->vm_flags & VM_SHARED)) return -EINVAL; - vma->vm_flags |= VM_PFNMAP | VM_DMEM | VM_IO; + vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY | VM_DMEM | VM_IO; + + if (dmem_page_size(inode) != PAGE_SIZE) + vma->vm_flags |= VM_HUGEPAGE; file_accessed(file); vma->vm_ops = &dmemfs_vm_ops; return 0; } +/* + * If the size of area returned by mm->get_unmapped_area() is one + * dmem pagesize larger than 'len', the returned addr by + * mm->get_unmapped_area() could be aligned to dmem pagesize to + * meet alignment demand. + */ +static unsigned long +dmemfs_get_unmapped_area(struct file *file, unsigned long addr, + unsigned long len, unsigned long pgoff, + unsigned long flags) +{ + unsigned long len_pad; + unsigned long off = pgoff << PAGE_SHIFT; + unsigned long align; + + align = dmem_page_size(file_inode(file)); + + /* For pud or pmd pagesize, could not support fault fallback. */ + if (len & (align - 1)) + return -EINVAL; + if (len > TASK_SIZE) + return -ENOMEM; + + if (flags & MAP_FIXED) { + if (addr & (align - 1)) + return -EINVAL; + return addr; + } + + /* + * Pad a extra align space for 'len', as we want to find a unmapped + * area which is larger enough to align with dmemfs pagesize, if + * pagesize of dmem is larger than 4K. + */ + len_pad = (align == PAGE_SIZE) ? len : len + align; + + /* 'len' or 'off' is too large for pad. */ + if (len_pad < len || (off + len_pad) < off) + return -EINVAL; + + addr = current->mm->get_unmapped_area(file, addr, len_pad, + pgoff, flags); + + /* Now 'addr' could be aligned to upper boundary. */ + return IS_ERR_VALUE(addr) ? addr : round_up(addr, align); +} + static const struct file_operations dmemfs_file_operations = { .mmap = dmemfs_file_mmap, + .get_unmapped_area = dmemfs_get_unmapped_area, }; static int dmemfs_parse_param(struct fs_context *fc, struct fs_parameter *param) From patchwork Mon Dec 7 11:31:18 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955503 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B280FC1B0E3 for ; Mon, 7 Dec 2020 11:37:29 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7AC67233A0 for ; Mon, 7 Dec 2020 11:37:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727745AbgLGLhP (ORCPT ); Mon, 7 Dec 2020 06:37:15 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40924 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727403AbgLGLf3 (ORCPT ); Mon, 7 Dec 2020 06:35:29 -0500 Received: from mail-pf1-x441.google.com (mail-pf1-x441.google.com [IPv6:2607:f8b0:4864:20::441]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B2C6BC094241; Mon, 7 Dec 2020 03:35:05 -0800 (PST) Received: by mail-pf1-x441.google.com with SMTP id i3so6030219pfd.6; Mon, 07 Dec 2020 03:35:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=DlQnH7arXkpV3AxvylV8KPwG+QOIFG2k6zceWF8EMek=; b=slULaRgClfSRtlq6MZQolWl14N3WoKzCDl91aATl6tX8HOfVXd7kfIwWn5h47oBm3h xiPGbI6vPy35U0R/c4QR8TcprKIJDFu9YT12UcQuplY38vOUzuIMOUmOng8ftygUWfTr R1ZAIx1wp1N0xM3YOBmolFbpCccHmRq/UeETV4HjisGBwUXZHQZrrafX6WsVEOPwXQu9 0N2lNLtp3gJavRT6Vww4WvELq4NKpp8UbezAjxjLamlZR2rUohbuyED91YZZx9c5W52C 8RHuWAx0H4siG5FrBXMPIyV8dIchxJEE581mAM4LCKH+oYKBsM2HDcxznhasrQeZ1BcL O2Tw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=DlQnH7arXkpV3AxvylV8KPwG+QOIFG2k6zceWF8EMek=; b=PXIYBKFFvVghRkPHNXzdVcE9ivR57Ta2J1FYEX673hG7DjbHXODYzKrodgoir2Pyty 4irQtt/ec8GYFVRAfA08c8415BrTei3Qd5dNqoRJfB77SvG8MnYY7x+UVnPBNGxMsiQI AFUvPgGRG9BHHt/bcrJ14iqBdajVZFKcUgj+fXrI+1EkTbyZR1BlWCTdjm111OUirCj5 5t0R66dN0baYMVypnKXL2phGfXckO/drnHcUNTHiSVG/Xq6YHvyUKG+LqgD4FmNlLr0w X/MkoA3Op2ncZ3ZQMyIgX8HEkUzHz/7tN4E7AZhtlhVyhgVNLvigKDg7wr1GX4WNGpdk 3CaQ== X-Gm-Message-State: AOAM532WOuQg82k2N7Gl28h3Kija0jMiGlVBNkzO/78RTefUxsqaZ8yi QFjXRPdSpCy8iAcu9MSr5tTkxzb0DG8= X-Google-Smtp-Source: ABdhPJzalODO+5vSfJ8urE4UuOEPrIa+ijmCZKRr1s/BpX28tmoQbHCWWrOgmsnDQaiZ1nRUpwNqDg== X-Received: by 2002:a17:902:bc4b:b029:db:2d61:5f37 with SMTP id t11-20020a170902bc4bb02900db2d615f37mr99717plz.79.1607340905372; Mon, 07 Dec 2020 03:35:05 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.35.02 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:35:04 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 25/37] mm, x86, dmem: fix estimation of reserved page for vaddr_get_pfn() Date: Mon, 7 Dec 2020 19:31:18 +0800 Message-Id: <3bd3e2d485c46fae9eaeb501e1a6c51d19570b49.1607332046.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang Fix estimation of reserved page for vaddr_get_pfn() and check 'ret' before checking writable permission Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- drivers/vfio/vfio_iommu_type1.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 67e8276..c465d1a 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -471,6 +471,10 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, if (ret == -EAGAIN) goto retry; + if (!ret && (prot & IOMMU_WRITE) && + !(vma->vm_flags & VM_WRITE)) + ret = -EFAULT; + if (!ret && !is_invalid_reserved_pfn(*pfn)) ret = -EFAULT; } From patchwork Mon Dec 7 11:31:19 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955507 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1DE85C2BB48 for ; Mon, 7 Dec 2020 11:37:30 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D1F812333F for ; Mon, 7 Dec 2020 11:37:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727730AbgLGLhO (ORCPT ); Mon, 7 Dec 2020 06:37:14 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40948 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727412AbgLGLfa (ORCPT ); Mon, 7 Dec 2020 06:35:30 -0500 Received: from mail-pf1-x432.google.com (mail-pf1-x432.google.com [IPv6:2607:f8b0:4864:20::432]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 17ED3C094242; Mon, 7 Dec 2020 03:35:09 -0800 (PST) Received: by mail-pf1-x432.google.com with SMTP id 131so9601916pfb.9; Mon, 07 Dec 2020 03:35:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=wOxoYAByFgjYHcAfYN08FzgKY6QNyvq8eeqHRPrva4Q=; b=IJifq/wHL5++Uh9KafwXFdFhAIVbfzAETWaT0Wz412DL1zioa8ks0IHqQMobHE0lb7 il49a16YoybvNwcEnH1LbheLRq9esiNl+KUZTTKeEYT3GykktEU9CyjsiyRRlcrL8V9i xjdjnm7I93RF446dlPxAOTolle3sJKf7sCBQPeMYaghlxfkhBNp+bJGbw3s+o46xRf18 SFvcQ0hHzZ5p8+RSv8+DKLLtLxoiI57o2MK8qgH6IXd7lmwhhobC8+LpPY6VhvKxG+CC hFuzLfRe+F25M9UbsIsS/AdyWNtgCGjD+2gfmnfUtyTDHFpMLDYdfzABVdQIvMR60xX3 6Y1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=wOxoYAByFgjYHcAfYN08FzgKY6QNyvq8eeqHRPrva4Q=; b=WpIjKRUrjstbjaJjmBIzi0qiMmbYP24njIDLy62SinlTwq9gNLNCOyGr+N9vAdvfG3 1aHn+xI+YWAQkoHqy2mLXT234rT1nXTteRKFrIszZEv0F9sSIECIvAIZv25sS5ZcOXGh IsTIrPf1LtygUxfwxNG0p3/BbLIguu2Zb4tv9oUuYB12EZaV6rNiqNwvshkU16zMa8XJ k4L5jPNoaNUGjfIq89dTeTpQl7jC38g7+1KielYkN1aDo81MDV8cK259AeMfaW11PVkh NiSlxnCk3x9/OcRQHDJdOLSJ2ihcYYBHhQbScf2ei0V4RWyavSmAwnF3Q/ion8taMJQd +diQ== X-Gm-Message-State: AOAM533hJB0hsk24Vc130O/6FJxfByAq0f7btPBcY9mXdCOQXmzhOwPv b8RIhxtgyabgcYzmu231I34= X-Google-Smtp-Source: ABdhPJwhpFxBYkxRsLS+BV+lvSO+B7245m5N1dEnwhx2kuzLaOR+A4L9InETPPJipSBryp7sZwn4Zw== X-Received: by 2002:a62:80ce:0:b029:19d:b280:5019 with SMTP id j197-20020a6280ce0000b029019db2805019mr15544917pfd.43.1607340908718; Mon, 07 Dec 2020 03:35:08 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.35.05 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:35:08 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 26/37] mm, dmem: introduce pud_special() for dmem huge pud support Date: Mon, 7 Dec 2020 19:31:19 +0800 Message-Id: <24c19b7db2fa3b405358489fc74a02cf648bfaf1.1607332046.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang pud_special() will check both _PAGE_SPECIAL and _PAGE_DMEM bit as pmd_special() does. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- arch/x86/include/asm/pgtable.h | 13 +++++++++++++ include/linux/pgtable.h | 10 ++++++++++ 2 files changed, 23 insertions(+) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 6ce85d4..9e36d42 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -281,6 +281,12 @@ static inline int pmd_special(pmd_t pmd) return (pmd_val(pmd) & (_PAGE_SPECIAL | _PAGE_DMEM)) == (_PAGE_SPECIAL | _PAGE_DMEM); } + +static inline int pud_special(pud_t pud) +{ + return (pud_val(pud) & (_PAGE_SPECIAL | _PAGE_DMEM)) == + (_PAGE_SPECIAL | _PAGE_DMEM); +} #endif #ifdef CONFIG_ARCH_HAS_PTE_DEVMAP @@ -516,6 +522,13 @@ static inline pud_t pud_mkdirty(pud_t pud) return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); } +#ifdef CONFIG_ARCH_HAS_PTE_DMEM +static inline pud_t pud_mkdmem(pud_t pud) +{ + return pud_set_flags(pud, _PAGE_SPECIAL | _PAGE_DMEM); +} +#endif + static inline pud_t pud_mkdevmap(pud_t pud) { return pud_set_flags(pud, _PAGE_DEVMAP); diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 30342b8..0ef03ff 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1167,6 +1167,16 @@ static inline int pmd_special(pmd_t pmd) { return 0; } + +static inline pud_t pud_mkdmem(pud_t pud) +{ + return pud; +} + +static inline int pud_special(pud_t pud) +{ + return 0; +} #endif #ifndef pmd_read_atomic From patchwork Mon Dec 7 11:31:20 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955497 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9C7B4C2BBCD for ; Mon, 7 Dec 2020 11:37:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 62D6C2333F for ; Mon, 7 Dec 2020 11:37:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727455AbgLGLgv (ORCPT ); Mon, 7 Dec 2020 06:36:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41002 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727470AbgLGLfh (ORCPT ); Mon, 7 Dec 2020 06:35:37 -0500 Received: from mail-pj1-x1034.google.com (mail-pj1-x1034.google.com [IPv6:2607:f8b0:4864:20::1034]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 74C6BC061A52; Mon, 7 Dec 2020 03:35:12 -0800 (PST) Received: by mail-pj1-x1034.google.com with SMTP id p21so6863193pjv.0; Mon, 07 Dec 2020 03:35:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=R1pOMiSZIT4zzWwCC3RI8cqklW1Nb035cekwxKtaaVc=; b=tdedZG4TA6YIeSkLXKLwKJSIDRJsgaWD49qSThY/5qRbEOGMhswjKF+y+3tE//Q9GX Z46nH7F5RztgV7IpQF+nvQh5fP0PRvVuDbCtmL+DqLUnzwsCtwEWrcKYUUE4IOslWky6 zQLJvUhDs403MDnqw27lWTBP71HJqyW9dk+grHTF+6OPgJ452jBXTUZi8b4Iy1LCRY0a B7KsP2/DYqTQU9tErE4rG6jV75Gvt9+RD+EOGMwwsh3JBsAoxFJxie7XD44CJODdyBMH ZZjkx5xM1o3RMpaknmRKIUEsPRJuMTFE4JxUIeBNSdaI67DN+EkZESbAsjtXaH7s5v8J dDgQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=R1pOMiSZIT4zzWwCC3RI8cqklW1Nb035cekwxKtaaVc=; b=XarqcgMty2DANalvWUi+wBj7eVwtfSz4uJHvuumkVUU4IfchAJVylU5645ao0FKS4d kJ6pHp/m2B8XdveZZ1rrNKCfPY3FlC2Zdu7YsLf/Kb240+QquKbT6DF1OsMZVarGHDcO F9BsMUgEbnvWmJny7sGaAP/Eb13Z8UJ5K0+1SSgLSbXNt0CDnYpBjy1HZCcIWzdNNUxL +kJJ9g6qADk7O1WcmRow6ZxqoIDGIRznInnRU0hVE9TT0s8l0VP+pW+FpkGbZD+N71T+ 7qcC8+zi6bV4FYe7wkIh7gkdoTixotG+cfxUHbK7CHZUxmZm6jDHPwUZuXXF/cDJKcN5 k9Lg== X-Gm-Message-State: AOAM533v/0xfl/sY5vLcuhxAcY0VedWLy8tMbl7MG4xCmHnV6+7xz5e1 R2UyGezRlZhUa/Z5SyEptJk= X-Google-Smtp-Source: ABdhPJwOADOp4BHe+nOTWmOnCV7Krmz1UonOWASME9633gE5AHOao5YhBEMB9G6gr5genfai3+Mg9g== X-Received: by 2002:a17:90a:9e5:: with SMTP id 92mr16288519pjo.176.1607340911988; Mon, 07 Dec 2020 03:35:11 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.35.08 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:35:11 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 27/37] mm: add pud_special() check to support dmem huge pud Date: Mon, 7 Dec 2020 19:31:20 +0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang Add pud_special() and follow_special_pud() to support dmem huge pud as we do for dmem huge pmd. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- arch/x86/include/asm/pgtable.h | 2 +- include/linux/huge_mm.h | 2 +- mm/gup.c | 46 ++++++++++++++++++++++++++++++++++++++++++ mm/huge_memory.c | 11 ++++++---- mm/memory.c | 4 ++-- mm/mprotect.c | 2 ++ mm/pagewalk.c | 2 +- 7 files changed, 60 insertions(+), 9 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 9e36d42..2284387 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -265,7 +265,7 @@ static inline int pmd_trans_huge(pmd_t pmd) #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD static inline int pud_trans_huge(pud_t pud) { - return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE; + return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP|_PAGE_DMEM)) == _PAGE_PSE; } #endif diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 2514b90..b69c940 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -251,7 +251,7 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd, static inline spinlock_t *pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma) { - if (pud_trans_huge(*pud) || pud_devmap(*pud)) + if (pud_trans_huge(*pud) || pud_devmap(*pud) || pud_special(*pud)) return __pud_trans_huge_lock(pud, vma); else return NULL; diff --git a/mm/gup.c b/mm/gup.c index 0ea9071..8eb85ba 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -423,6 +423,42 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, return ERR_PTR(-EEXIST); } +static struct page * +follow_special_pud(struct vm_area_struct *vma, unsigned long address, + pud_t *pud, unsigned int flags) +{ + spinlock_t *ptl; + + if ((flags & FOLL_DUMP) && is_huge_zero_pud(*pud)) + /* Avoid special (like zero) pages in core dumps */ + return ERR_PTR(-EFAULT); + + /* No page to get reference */ + if (flags & FOLL_GET) + return ERR_PTR(-EFAULT); + + if (flags & FOLL_TOUCH) { + pud_t _pud; + + ptl = pud_lock(vma->vm_mm, pud); + if (!pud_special(*pud)) { + spin_unlock(ptl); + return NULL; + } + _pud = pud_mkyoung(*pud); + if (flags & FOLL_WRITE) + _pud = pud_mkdirty(_pud); + if (pudp_set_access_flags(vma, address & HPAGE_PMD_MASK, + pud, _pud, + flags & FOLL_WRITE)) + update_mmu_cache_pud(vma, address, pud); + spin_unlock(ptl); + } + + /* Proper page table entry exists, but no corresponding struct page */ + return ERR_PTR(-EEXIST); +} + /* * FOLL_FORCE can write to even unwritable pte's, but only * after we've gone through a COW cycle and they are dirty. @@ -726,6 +762,12 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma, return page; return no_page_table(vma, flags); } + if (pud_special(*pud)) { + page = follow_special_pud(vma, address, pud, flags); + if (page) + return page; + return no_page_table(vma, flags); + } if (is_hugepd(__hugepd(pud_val(*pud)))) { page = follow_huge_pd(vma, address, __hugepd(pud_val(*pud)), flags, @@ -2511,6 +2553,10 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, if (!pud_access_permitted(orig, flags & FOLL_WRITE)) return 0; + /* Bypass dmem pud. It will be handled in outside routine. */ + if (pud_special(orig)) + return 0; + if (pud_devmap(orig)) { if (unlikely(flags & FOLL_LONGTERM)) return 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6e52d57..7c5385a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -883,6 +883,8 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr, entry = pud_mkhuge(pfn_t_pud(pfn, prot)); if (pfn_t_devmap(pfn)) entry = pud_mkdevmap(entry); + if (pfn_t_dmem(pfn)) + entry = pud_mkdmem(entry); if (write) { entry = pud_mkyoung(pud_mkdirty(entry)); entry = maybe_pud_mkwrite(entry, vma); @@ -919,7 +921,7 @@ vm_fault_t vmf_insert_pfn_pud_prot(struct vm_fault *vmf, pfn_t pfn, * can't support a 'special' bit. */ BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && - !pfn_t_devmap(pfn)); + !pfn_t_devmap(pfn) && !pfn_t_dmem(pfn)); BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) == (VM_PFNMAP|VM_MIXEDMAP)); BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags)); @@ -1911,7 +1913,7 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma) spinlock_t *ptl; ptl = pud_lock(vma->vm_mm, pud); - if (likely(pud_trans_huge(*pud) || pud_devmap(*pud))) + if (likely(pud_trans_huge(*pud) || pud_devmap(*pud) || pud_special(*pud))) return ptl; spin_unlock(ptl); return NULL; @@ -1922,6 +1924,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud, unsigned long addr) { spinlock_t *ptl; + pud_t orig_pud; ptl = __pud_trans_huge_lock(pud, vma); if (!ptl) @@ -1932,9 +1935,9 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, * pgtable_trans_huge_withdraw after finishing pudp related * operations. */ - pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm); + orig_pud = pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm); tlb_remove_pud_tlb_entry(tlb, pud, addr); - if (vma_is_special_huge(vma)) { + if (vma_is_special_huge(vma) || pud_special(orig_pud)) { spin_unlock(ptl); /* No zero page support yet */ } else { diff --git a/mm/memory.c b/mm/memory.c index abb9148..01f3b05 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1078,7 +1078,7 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr, src_pud = pud_offset(src_p4d, addr); do { next = pud_addr_end(addr, end); - if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) { + if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud) || pud_special(*src_pud)) { int err; VM_BUG_ON_VMA(next-addr != HPAGE_PUD_SIZE, src_vma); @@ -1375,7 +1375,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb, pud = pud_offset(p4d, addr); do { next = pud_addr_end(addr, end); - if (pud_trans_huge(*pud) || pud_devmap(*pud)) { + if (pud_trans_huge(*pud) || pud_devmap(*pud) || pud_special(*pud)) { if (next - addr != HPAGE_PUD_SIZE) { mmap_assert_locked(tlb->mm); split_huge_pud(vma, pud, addr); diff --git a/mm/mprotect.c b/mm/mprotect.c index b1650b5..05fa453 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -292,6 +292,8 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma, pud = pud_offset(p4d, addr); do { next = pud_addr_end(addr, end); + if (pud_special(*pud)) + continue; if (pud_none_or_clear_bad(pud)) continue; pages += change_pmd_range(vma, pud, addr, next, newprot, diff --git a/mm/pagewalk.c b/mm/pagewalk.c index e7c4575..afd8bca 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -129,7 +129,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end, do { again: next = pud_addr_end(addr, end); - if (pud_none(*pud) || (!walk->vma && !walk->no_vma)) { + if (pud_none(*pud) || (!walk->vma && !walk->no_vma) || pud_special(*pud)) { if (ops->pte_hole) err = ops->pte_hole(addr, next, depth, walk); if (err) From patchwork Mon Dec 7 11:31:21 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955501 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.9 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNWANTED_LANGUAGE_BODY, URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1453EC4167B for ; Mon, 7 Dec 2020 11:37:29 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B7D0A2333F for ; Mon, 7 Dec 2020 11:37:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727381AbgLGLhI (ORCPT ); Mon, 7 Dec 2020 06:37:08 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40898 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727425AbgLGLfc (ORCPT ); Mon, 7 Dec 2020 06:35:32 -0500 Received: from mail-pf1-x431.google.com (mail-pf1-x431.google.com [IPv6:2607:f8b0:4864:20::431]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9C89CC0613D2; Mon, 7 Dec 2020 03:35:15 -0800 (PST) Received: by mail-pf1-x431.google.com with SMTP id c79so9618736pfc.2; Mon, 07 Dec 2020 03:35:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=WOqN3wAFWNmwPLLbitVjeNVP3/1EeNDqwEgp1U7by0Y=; b=t2mhCRx4ddW/Vy1XTuXt7xLNO/CzgHXZeydB8wf2nGDpcuoOfUMGpcemrPer6RwsAu uNrNzd2nFAftQMHWukm68XBY2ZReEiQDmMuknuogHStEJ92RpzjB90S9K9a1cbyDhi2W 9Vm6qF/G07WKj7L5wgUVDnQOr5/Zh79lLesUmR9Lh/aRQyBdGzd14FIUZji0fOJd8pl+ /jy3wxMeIMmrTvfNtbWdW+RdAKQyWUr9HmQ31ldStU+rb3XP8uT16WXN0+pGuRlr4ZTK AOGRKulren3uPBcL5hwFYJaFeUE2vqsl9ziKH6yRfEJ//oEfAk2A5bJm4l2xsQwEEOOy Hptg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=WOqN3wAFWNmwPLLbitVjeNVP3/1EeNDqwEgp1U7by0Y=; b=EJLd6ULjOYjaQkY3OaixEUj1mOdd9JsPSLmGlkfqLqNyMLwUPr9nUw7XWbEqOWjApV gsId4qNCfbb8plitRD0isCdXEpR9H/+jz6Msx7uFeHrKYO5CAqDYiHPnhjSlve5JY8tg 7+nMVVch5sVpdUWN3a98v/YV6xUQL57UIxakp2ZpU3c+fqFDd2lfrkXSS63ryHeuksq8 cW5NBcH3giMPasNzBaa25wSANyHbIQH5+9ImSKl3QQx+ryy91Km6PvbIR+Z86zilCI5B 7S+sIpw29Bw3Ypvw9zv5JyS4ejJn1xkhoLlsbHw+WvCAGQ8Rl5Yec+tfat1IxMy75iFL jlHQ== X-Gm-Message-State: AOAM533agR3739qsaOo0XZaZZTUqLletb8ko/XHYi95HNioLEAKRIRVE MLNhQsiWbol5FM/7gdC2xtU= X-Google-Smtp-Source: ABdhPJwbvfVevLMEXxhQw8TM4SqqQZ/7ZuW0O/nKHs8QddAXnvoNRyEBcv9DCTj5vDpAlxEcd07BrA== X-Received: by 2002:a17:902:860a:b029:da:e83a:7f1f with SMTP id f10-20020a170902860ab02900dae83a7f1fmr7784842plo.60.1607340915254; Mon, 07 Dec 2020 03:35:15 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.35.12 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:35:14 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 28/37] mm, dmemfs: support huge_fault() for dmemfs Date: Mon, 7 Dec 2020 19:31:21 +0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang Introduce __dmemfs_huge_fault() to handle 1G huge pud for dmemfs. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- fs/dmemfs/inode.c | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index 17a518c..f698b9d 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -519,6 +519,43 @@ static vm_fault_t __dmemfs_pmd_fault(struct vm_fault *vmf) return ret; } +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD +static vm_fault_t __dmemfs_huge_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + unsigned long pud_addr = vmf->address & PUD_MASK; + struct inode *inode = file_inode(vma->vm_file); + void *entry; + phys_addr_t phys; + pfn_t pfn; + int ret; + + if (dmem_page_size(inode) < PUD_SIZE) + return VM_FAULT_FALLBACK; + + WARN_ON(pud_addr < vma->vm_start || + vma->vm_end < pud_addr + PUD_SIZE); + + entry = radix_get_create_entry(vma, pud_addr, inode, + linear_page_index(vma, pud_addr)); + if (IS_ERR(entry)) + return (PTR_ERR(entry) == -ENOMEM) ? + VM_FAULT_OOM : VM_FAULT_SIGBUS; + + phys = dmem_entry_to_addr(inode, entry); + pfn = phys_to_pfn_t(phys, PFN_DMEM); + ret = vmf_insert_pfn_pud(vmf, pfn, !!(vma->vm_flags & VM_WRITE)); + + radix_put_entry(); + return ret; +} +#else +static vm_fault_t __dmemfs_huge_fault(struct vm_fault *vmf) +{ + return VM_FAULT_FALLBACK; +} +#endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ + static vm_fault_t dmemfs_huge_fault(struct vm_fault *vmf, enum page_entry_size pe_size) { int ret; @@ -530,6 +567,9 @@ static vm_fault_t dmemfs_huge_fault(struct vm_fault *vmf, enum page_entry_size p case PE_SIZE_PMD: ret = __dmemfs_pmd_fault(vmf); break; + case PE_SIZE_PUD: + ret = __dmemfs_huge_fault(vmf); + break; default: ret = VM_FAULT_SIGBUS; } From patchwork Mon Dec 7 11:31:22 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955499 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A93DBC433FE for ; Mon, 7 Dec 2020 11:37:28 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4295E2333F for ; Mon, 7 Dec 2020 11:37:28 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727247AbgLGLg6 (ORCPT ); Mon, 7 Dec 2020 06:36:58 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41008 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727436AbgLGLfe (ORCPT ); Mon, 7 Dec 2020 06:35:34 -0500 Received: from mail-pg1-x536.google.com (mail-pg1-x536.google.com [IPv6:2607:f8b0:4864:20::536]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EF114C0613D1; Mon, 7 Dec 2020 03:35:18 -0800 (PST) Received: by mail-pg1-x536.google.com with SMTP id o5so8647982pgm.10; Mon, 07 Dec 2020 03:35:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=TpyAX8bjmgKWxsxprFNaku2nEJ5v4XWrim2WqmZHDpg=; b=eS4D42NrIJAv240gGh41rvA64V7u41KBS4tRtOuPLsuYqHzqarSS8/n7/tX9diZSle XcvHurUEWFhPNZhnW01+0VblwSa4+nqPfvuelH67hPrhQTsGl0xYiozhsfaNQdt9UcuU xTIW+q4n5aGxXZ9kufVbFlTvHWOCocxT0VIsxD+mL1ugo2Z2vjK6AUIMi/F+wUKrXguJ KblNy3cm9NJAk6ucQE5OjL/xyx5c3YMwL7Ulxvlteu0bTmPrUEiEIsIfNOYUO1n/oHDR 4l0DL0eiG4Uvtj1Wdkd9w/kb6JwDma9v7EE8NNl/eVDjMBv/foOGibW+Nd/9kNb3U8bT M6PQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=TpyAX8bjmgKWxsxprFNaku2nEJ5v4XWrim2WqmZHDpg=; b=oSaccfZdFyLdXBIeBAG6Ibot2q6EgMd87UYNw5Lat6g69oxF4SDGoOnOyEF0HuZHxQ gE0O5NWbkwwTjUbwJMaGpKt0b542FhE2BaVBN6n5uPPy3i18IhFVqK+1qkXGVRRnRYmT EHw4kPUBga7a62an1fSWCGud60lqd54K1c9g3NgJElgNBI9VcZFiCJshTFo7zI9501P6 G9voRHKeg3nHRxeLPCSzshmpquMXQLvxkpQy+8fYkx4wf4fSJkNLdhoLlrTcTxbO2fgF HU8qWIe57oTw2lhk8sdPJmOjhWm/tkdP74aZMEQaxPWahTqYSqObK18D+Lbtq+Rrxhlj C9Rg== X-Gm-Message-State: AOAM532ANJSZ01dg9THtSdDHSz00u82/ab0xhfHfILYwAawxH1uC7Auj EC35AGPxPMtee6CokdVLTAI= X-Google-Smtp-Source: ABdhPJwyNYWW30gHq9pOZotI8O5TvgnyfvtFboa27PYJJsWjcCMUClmt5pZogoHFcS4MUOgwfk1x+Q== X-Received: by 2002:a63:7f03:: with SMTP id a3mr17930488pgd.313.1607340918580; Mon, 07 Dec 2020 03:35:18 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.35.15 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:35:18 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 29/37] mm: add follow_pte_pud() to support huge pud look up Date: Mon, 7 Dec 2020 19:31:22 +0800 Message-Id: <43e14d8a452789e5321b93810a50acfe95672e99.1607332046.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang Since we had supported dmem huge pud, here support dmem huge pud for hva_to_pfn(). Similar to follow_pte_pmd(), follow_pte_pud() allows a PTE lead or a huge page PMD or huge page PUD to be found and returned. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- mm/memory.c | 52 ++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 44 insertions(+), 8 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 01f3b05..dfc95be 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4698,9 +4698,9 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) } #endif /* __PAGETABLE_PMD_FOLDED */ -static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, +static int __follow_pte_pud(struct mm_struct *mm, unsigned long address, struct mmu_notifier_range *range, - pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) + pte_t **ptepp, pmd_t **pmdpp, pud_t **pudpp, spinlock_t **ptlp) { pgd_t *pgd; p4d_t *p4d; @@ -4717,6 +4717,26 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, goto out; pud = pud_offset(p4d, address); + VM_BUG_ON(pud_trans_huge(*pud)); + if (pud_huge(*pud)) { + if (!pudpp) + goto out; + + if (range) { + mmu_notifier_range_init(range, MMU_NOTIFY_CLEAR, 0, + NULL, mm, address & PUD_MASK, + (address & PUD_MASK) + PUD_SIZE); + mmu_notifier_invalidate_range_start(range); + } + *ptlp = pud_lock(mm, pud); + if (pud_huge(*pud)) { + *pudpp = pud; + return 0; + } + spin_unlock(*ptlp); + if (range) + mmu_notifier_invalidate_range_end(range); + } if (pud_none(*pud) || unlikely(pud_bad(*pud))) goto out; @@ -4772,8 +4792,8 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address, /* (void) is needed to make gcc happy */ (void) __cond_lock(*ptlp, - !(res = __follow_pte_pmd(mm, address, NULL, - ptepp, NULL, ptlp))); + !(res = __follow_pte_pud(mm, address, NULL, + ptepp, NULL, NULL, ptlp))); return res; } @@ -4785,12 +4805,24 @@ int follow_pte_pmd(struct mm_struct *mm, unsigned long address, /* (void) is needed to make gcc happy */ (void) __cond_lock(*ptlp, - !(res = __follow_pte_pmd(mm, address, range, - ptepp, pmdpp, ptlp))); + !(res = __follow_pte_pud(mm, address, range, + ptepp, pmdpp, NULL, ptlp))); return res; } EXPORT_SYMBOL(follow_pte_pmd); +int follow_pte_pud(struct mm_struct *mm, unsigned long address, + struct mmu_notifier_range *range, + pte_t **ptepp, pmd_t **pmdpp, pud_t **pudpp, spinlock_t **ptlp) +{ + int res; + + /* (void) is needed to make gcc happy */ + (void) __cond_lock(*ptlp, + !(res = __follow_pte_pud(mm, address, range, + ptepp, pmdpp, pudpp, ptlp))); + return res; +} /** * follow_pfn - look up PFN at a user virtual address * @vma: memory mapping @@ -4808,15 +4840,19 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address, spinlock_t *ptl; pte_t *ptep; pmd_t *pmdp = NULL; + pud_t *pudp = NULL; if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) return ret; - ret = follow_pte_pmd(vma->vm_mm, address, NULL, &ptep, &pmdp, &ptl); + ret = follow_pte_pud(vma->vm_mm, address, NULL, &ptep, &pmdp, &pudp, &ptl); if (ret) return ret; - if (pmdp) { + if (pudp) { + *pfn = pud_pfn(*pudp) + ((address & ~PUD_MASK) >> PAGE_SHIFT); + spin_unlock(ptl); + } else if (pmdp) { *pfn = pmd_pfn(*pmdp) + ((address & ~PMD_MASK) >> PAGE_SHIFT); spin_unlock(ptl); } else { From patchwork Mon Dec 7 11:31:23 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955495 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F228AC1B0E3 for ; Mon, 7 Dec 2020 11:37:01 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BCD2E233A1 for ; Mon, 7 Dec 2020 11:37:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727457AbgLGLfi (ORCPT ); Mon, 7 Dec 2020 06:35:38 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41012 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727461AbgLGLfh (ORCPT ); Mon, 7 Dec 2020 06:35:37 -0500 Received: from mail-pg1-x544.google.com (mail-pg1-x544.google.com [IPv6:2607:f8b0:4864:20::544]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 32DBFC0613D4; Mon, 7 Dec 2020 03:35:22 -0800 (PST) Received: by mail-pg1-x544.google.com with SMTP id 69so806577pgg.8; Mon, 07 Dec 2020 03:35:22 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=amXRNC1mfdywzpWrbOff714/vb9YHjvjsXiNQKXcqcA=; b=k6DezM51fJayfcLxJQektp8YJvDSsBbz0aVlfSnHNBt+2Wxn4RS9e6sNgSGv12LZdn fScIMhUj9dLNh7Arm9VHs4vEpVXoSPwObS0+Csb8MMe9MYzmH6dg0lqwqyG6nAde1wOi 34bTopBZUATvrIHVmW1MmgCkKXfJgTCxl5kBjmRFwuXuFss9pzdJv061rMlRT0sR6cvm QdbctetT25eotvN/DrxAcuWPDaSg3ZFzoyaDSMKXEBdg8nUVJsjJWPK3XNCksnNUGHkB SEu+RmRnibNi7ublI6HydqMUjgwWBf6MWZu/qay+0SGsOUxJkYzQEZGovb73STeCAsJh 9Bug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=amXRNC1mfdywzpWrbOff714/vb9YHjvjsXiNQKXcqcA=; b=Z7/2AO6O4HHXhClBz5MfEczfyWJ/xVkgc6QDHctbpG3YN+RK9Sw6iUlBS4cpajposo evRzf1NYWJTtLR5Lk+yy67QbZu17yYm527/rMeuIFZbC3FcpkcfY4eNoE9BAVtvuxtu3 flI0SnJWhSZ8ZDplsijlXSU4WgI4lyoEROUgrvQ0jCLUhlREJRhD8Wl016vN96C+gwSd IP3RQZ5y50AHjYw4fepQApVGugMIBdYEZkaUvkpgBtLKuCMa2LXJjEn0B1/eiZs0HKvP D4guH+Z2vNW5tPRWdijKxmfnx/ojwBhfS7ZB81SfYNgKept/S6zhQnw2xW2KcoGYSnl4 A/qw== X-Gm-Message-State: AOAM530FU821JmRsnawHHOa1cvzA+FwqDhSgHwsuzf4lLOjdRtC2i1FM XT60dtx3qK9+mei9iWl7oZg= X-Google-Smtp-Source: ABdhPJzktfbAut0a4U/9/77Kk3IPuy2wYdqsh6ukZ0LN08xk9L3dyRMq6PVyY4Nz0UXg4U/HCNjWVQ== X-Received: by 2002:a17:902:b101:b029:da:c50e:cd56 with SMTP id q1-20020a170902b101b02900dac50ecd56mr15820085plr.59.1607340921829; Mon, 07 Dec 2020 03:35:21 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.35.18 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:35:21 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 30/37] dmem: introduce dmem_bitmap_alloc() and dmem_bitmap_free() Date: Mon, 7 Dec 2020 19:31:23 +0800 Message-Id: <6eca6b9b58b3cf9a52c8227ee92d9b926c249f0b.1607332046.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang If dmem contained in dmem region is too large and dmemfs is mounted as 4K pagesize, size of bitmap in this dmem region maybe exceed maximal available memory of kzalloc(). It would cause kzalloc() fail. So introduce dmem_bitmap_alloc() and use vzalloc() if bitmap is larger than PAGE_SIZE as vzalloc() will get sparse page. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- fs/inode.c | 6 +++++ include/linux/fs.h | 1 + mm/dmem.c | 69 ++++++++++++++++++++++++++++++++++-------------------- 3 files changed, 50 insertions(+), 26 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 9d78c37..9b6363d3 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -210,6 +210,12 @@ int inode_init_always(struct super_block *sb, struct inode *inode) } EXPORT_SYMBOL(inode_init_always); +struct inode *alloc_inode_nonrcu(void) +{ + return kmem_cache_alloc(inode_cachep, GFP_KERNEL); +} +EXPORT_SYMBOL(alloc_inode_nonrcu); + void free_inode_nonrcu(struct inode *inode) { kmem_cache_free(inode_cachep, inode); diff --git a/include/linux/fs.h b/include/linux/fs.h index 8667d0c..bc7a89c 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2937,6 +2937,7 @@ static inline bool is_zero_ino(ino_t ino) extern void __destroy_inode(struct inode *); extern struct inode *new_inode_pseudo(struct super_block *sb); extern struct inode *new_inode(struct super_block *sb); +extern struct inode *alloc_inode_nonrcu(void); extern void free_inode_nonrcu(struct inode *inode); extern int should_remove_suid(struct dentry *); extern int file_remove_privs(struct file *); diff --git a/mm/dmem.c b/mm/dmem.c index eb6df70..50cdff9 100644 --- a/mm/dmem.c +++ b/mm/dmem.c @@ -17,6 +17,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS #include @@ -362,9 +363,38 @@ static int __init dmem_node_init(struct dmem_node *dnode) return 0; } +static unsigned long *dmem_bitmap_alloc(unsigned long pages, + unsigned long *static_bitmap) +{ + unsigned long *bitmap, size; + + size = BITS_TO_LONGS(pages) * sizeof(long); + if (size <= sizeof(*static_bitmap)) + bitmap = static_bitmap; + else if (size <= PAGE_SIZE) + bitmap = kzalloc(size, GFP_KERNEL); + else + bitmap = vzalloc(size); + + return bitmap; +} + +static void dmem_bitmap_free(unsigned long pages, + unsigned long *bitmap, + unsigned long *static_bitmap) +{ + unsigned long size; + + size = BITS_TO_LONGS(pages) * sizeof(long); + if (size > PAGE_SIZE) + vfree(bitmap); + else if (bitmap != static_bitmap) + kfree(bitmap); +} + static void __init dmem_region_uinit(struct dmem_region *dregion) { - unsigned long nr_pages, size, *bitmap = dregion->error_bitmap; + unsigned long nr_pages, *bitmap = dregion->error_bitmap; if (!bitmap) return; @@ -374,9 +404,7 @@ static void __init dmem_region_uinit(struct dmem_region *dregion) WARN_ON(!nr_pages); - size = BITS_TO_LONGS(nr_pages) * sizeof(long); - if (size > sizeof(dregion->static_bitmap)) - kfree(bitmap); + dmem_bitmap_free(nr_pages, bitmap, &dregion->static_error_bitmap); dregion->error_bitmap = NULL; } @@ -405,19 +433,15 @@ static void __init dmem_uinit(void) static int __init dmem_region_init(struct dmem_region *dregion) { - unsigned long *bitmap, size, nr_pages; + unsigned long *bitmap, nr_pages; nr_pages = __phys_to_pfn(dregion->reserved_end_addr) - __phys_to_pfn(dregion->reserved_start_addr); - size = BITS_TO_LONGS(nr_pages) * sizeof(long); - if (size <= sizeof(dregion->static_error_bitmap)) { - bitmap = &dregion->static_error_bitmap; - } else { - bitmap = kzalloc(size, GFP_KERNEL); - if (!bitmap) - return -ENOMEM; - } + bitmap = dmem_bitmap_alloc(nr_pages, &dregion->static_error_bitmap); + if (!bitmap) + return -ENOMEM; + dregion->error_bitmap = bitmap; return 0; } @@ -472,7 +496,7 @@ static int __init dmem_late_init(void) static int dmem_alloc_region_init(struct dmem_region *dregion, unsigned long *dpages) { - unsigned long start, end, *bitmap, size; + unsigned long start, end, *bitmap; start = DMEM_PAGE_UP(dregion->reserved_start_addr); end = DMEM_PAGE_DOWN(dregion->reserved_end_addr); @@ -481,14 +505,9 @@ static int dmem_alloc_region_init(struct dmem_region *dregion, if (!*dpages) return 0; - size = BITS_TO_LONGS(*dpages) * sizeof(long); - if (size <= sizeof(dregion->static_bitmap)) - bitmap = &dregion->static_bitmap; - else { - bitmap = kzalloc(size, GFP_KERNEL); - if (!bitmap) - return -ENOMEM; - } + bitmap = dmem_bitmap_alloc(*dpages, &dregion->static_bitmap); + if (!bitmap) + return -ENOMEM; dregion->bitmap = bitmap; dregion->next_free_pos = 0; @@ -582,7 +601,7 @@ static void dmem_uinit_check_alloc_bitmap(struct dmem_region *dregion) static void dmem_alloc_region_uinit(struct dmem_region *dregion) { - unsigned long dpages, size, *bitmap = dregion->bitmap; + unsigned long dpages, *bitmap = dregion->bitmap; if (!bitmap) return; @@ -592,9 +611,7 @@ static void dmem_alloc_region_uinit(struct dmem_region *dregion) dmem_uinit_check_alloc_bitmap(dregion); - size = BITS_TO_LONGS(dpages) * sizeof(long); - if (size > sizeof(dregion->static_bitmap)) - kfree(bitmap); + dmem_bitmap_free(dpages, bitmap, &dregion->static_bitmap); dregion->bitmap = NULL; } From patchwork Mon Dec 7 11:31:24 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955481 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 077DDC1B0E3 for ; Mon, 7 Dec 2020 11:36:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D1D73233A1 for ; Mon, 7 Dec 2020 11:36:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727518AbgLGLfs (ORCPT ); Mon, 7 Dec 2020 06:35:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41030 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726820AbgLGLfq (ORCPT ); Mon, 7 Dec 2020 06:35:46 -0500 Received: from mail-pj1-x1041.google.com (mail-pj1-x1041.google.com [IPv6:2607:f8b0:4864:20::1041]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B1203C0613D3; Mon, 7 Dec 2020 03:35:25 -0800 (PST) Received: by mail-pj1-x1041.google.com with SMTP id lb18so4650807pjb.5; Mon, 07 Dec 2020 03:35:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=RqR44fauEeBZpjBAC1kdf97KbTscq65KeJ9aEDRCpGI=; b=Rw3amSd3rBlJpXc7dX1ne8f178KphM/6QUoRTeExFUzM83n8/SHLhQ1jf5VnNMZZoq WuAjjkZMPHqpQpdQvXvFrk1Lrq3tYST8SBgJNrKuHm5OX6smgd6BEFs4OIMr7YERjyHE +l8qRaSkfTcRGuTg4cWZJ7rhUL/vn+7/MfHju29iuvNwlvUzvEEXuWed5lRO12cXlsAx XrCYPpBkWM0g4QcxIQmTZCq99AtRHUV0xUPFWj3Jr2NdsV9RrOanCNVSLbsuvivzchZf +SeAeVSMNE20dUY9LBSn+wbwDwf7toSjY/933D0g8CyRml2C5K4r/casmfAA+xjS1HuQ ICHg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=RqR44fauEeBZpjBAC1kdf97KbTscq65KeJ9aEDRCpGI=; b=GYeU0aeGM94M3BKOhjFYyyKOPLOtl2g+p721sZhD7luQywZ8S3FyP8A9AAMIoxF1/V W/8b1zm+RuHsoRfpxEhp5ZVA+DY+M5WmGGnKT7bOhSZKqmlDFzdebNvFVUCckY5rufmW o4+VviV+0/HsyVxcyWyFbPZaXXR1S5X4iBPOtdE0n9YJ3bdBMOxepGcMH+OEPz0e/bpf +lcnXUW/Tw/idGPz7xN3+6cdFMPl/jkZgFu0YlVS0/u9Altm8ex3aQ270mRy6eyUs58E MxXffbvH0J6aXTejm5VcOsdC6EkaAY2tnPfqmWxfhTuMvcRZ9dwD964V+LCBKDvm6tzR t1gA== X-Gm-Message-State: AOAM530moruTCe4XwzR4jVkKr5xbFsF2PoCbHnQS1NwWghXxjY12PP2J jMnDNcbbFXyDvdopwx669FbPyk+l0bc= X-Google-Smtp-Source: ABdhPJyk3zQTpnyvz7bT7T3Uf0mnoGZKCzyksipWfZnA8cr+UkGx6yntGVpICOae5nxRYw96Y20SUg== X-Received: by 2002:a17:90b:11d5:: with SMTP id gv21mr930902pjb.12.1607340925289; Mon, 07 Dec 2020 03:35:25 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.35.22 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:35:24 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Haiwei Li Subject: [RFC V2 31/37] dmem: introduce mce handler Date: Mon, 7 Dec 2020 19:31:24 +0800 Message-Id: <6a5471107b81ee999f776547f2fccb045967701e.1607332046.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang dmem handle the mce if the pfn belongs to dmem when mce occurs. 1. check whether the pfn is handled by dmem. return if true. 2. mark the pfn in a new error bitmap defined in page. 3. a series of mechanism to ensure that the mce pfn is not allocated. Signed-off-by: Haiwei Li Signed-off-by: Yulei Zhang --- include/linux/dmem.h | 6 +++ include/trace/events/dmem.h | 17 ++++++++ mm/dmem.c | 103 +++++++++++++++++++++++++++++++------------- mm/memory-failure.c | 6 +++ 4 files changed, 102 insertions(+), 30 deletions(-) diff --git a/include/linux/dmem.h b/include/linux/dmem.h index 59d3ef14..cd17a91 100644 --- a/include/linux/dmem.h +++ b/include/linux/dmem.h @@ -21,6 +21,8 @@ void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr); bool is_dmem_pfn(unsigned long pfn); #define dmem_free_page(addr) dmem_free_pages(addr, 1) + +bool dmem_memory_failure(unsigned long pfn, int flags); #else static inline int dmem_reserve_init(void) { @@ -32,5 +34,9 @@ static inline bool is_dmem_pfn(unsigned long pfn) return 0; } +static inline bool dmem_memory_failure(unsigned long pfn, int flags) +{ + return false; +} #endif #endif /* _LINUX_DMEM_H */ diff --git a/include/trace/events/dmem.h b/include/trace/events/dmem.h index 10d1b90..f8eeb3c 100644 --- a/include/trace/events/dmem.h +++ b/include/trace/events/dmem.h @@ -62,6 +62,23 @@ TP_printk("addr %#lx dpages_nr %d", (unsigned long)__entry->addr, __entry->dpages_nr) ); + +TRACE_EVENT(dmem_memory_failure, + TP_PROTO(unsigned long pfn, bool used), + TP_ARGS(pfn, used), + + TP_STRUCT__entry( + __field(unsigned long, pfn) + __field(bool, used) + ), + + TP_fast_assign( + __entry->pfn = pfn; + __entry->used = used; + ), + + TP_printk("pfn=%#lx used=%d", __entry->pfn, __entry->used) +); #endif /* This part must be outside protection */ diff --git a/mm/dmem.c b/mm/dmem.c index 50cdff9..16438db 100644 --- a/mm/dmem.c +++ b/mm/dmem.c @@ -431,6 +431,41 @@ static void __init dmem_uinit(void) dmem_pool.registered_pages = 0; } +/* set or clear corresponding bit on allocation bitmap based on error bitmap */ +static unsigned long dregion_alloc_bitmap_set_clear(struct dmem_region *dregion, + bool set) +{ + unsigned long pos_pfn, pos_offset; + unsigned long valid_pages, mce_dpages = 0; + phys_addr_t dpage, reserved_start_pfn; + + reserved_start_pfn = __phys_to_pfn(dregion->reserved_start_addr); + + valid_pages = dpage_to_pfn(dregion->dpage_end_pfn) - reserved_start_pfn; + pos_offset = dpage_to_pfn(dregion->dpage_start_pfn) + - reserved_start_pfn; +try_set: + pos_pfn = find_next_bit(dregion->error_bitmap, valid_pages, pos_offset); + + if (pos_pfn >= valid_pages) + return mce_dpages; + mce_dpages++; + dpage = pfn_to_dpage(pos_pfn + reserved_start_pfn); + if (set) + WARN_ON(__test_and_set_bit(dpage - dregion->dpage_start_pfn, + dregion->bitmap)); + else + WARN_ON(!__test_and_clear_bit(dpage - dregion->dpage_start_pfn, + dregion->bitmap)); + pos_offset = dpage_to_pfn(dpage + 1) - reserved_start_pfn; + goto try_set; +} + +static unsigned long dmem_region_mark_mce_dpages(struct dmem_region *dregion) +{ + return dregion_alloc_bitmap_set_clear(dregion, true); +} + static int __init dmem_region_init(struct dmem_region *dregion) { unsigned long *bitmap, nr_pages; @@ -514,6 +549,8 @@ static int dmem_alloc_region_init(struct dmem_region *dregion, dregion->dpage_start_pfn = start; dregion->dpage_end_pfn = end; + *dpages -= dmem_region_mark_mce_dpages(dregion); + dmem_pool.unaligned_pages += __phys_to_pfn((dpage_to_phys(start) - dregion->reserved_start_addr)); dmem_pool.unaligned_pages += __phys_to_pfn(dregion->reserved_end_addr @@ -558,36 +595,6 @@ static bool dmem_dpage_is_error(struct dmem_region *dregion, phys_addr_t dpage) return err_num; } -/* set or clear corresponding bit on allocation bitmap based on error bitmap */ -static unsigned long dregion_alloc_bitmap_set_clear(struct dmem_region *dregion, - bool set) -{ - unsigned long pos_pfn, pos_offset; - unsigned long valid_pages, mce_dpages = 0; - phys_addr_t dpage, reserved_start_pfn; - - reserved_start_pfn = __phys_to_pfn(dregion->reserved_start_addr); - - valid_pages = dpage_to_pfn(dregion->dpage_end_pfn) - reserved_start_pfn; - pos_offset = dpage_to_pfn(dregion->dpage_start_pfn) - - reserved_start_pfn; -try_set: - pos_pfn = find_next_bit(dregion->error_bitmap, valid_pages, pos_offset); - - if (pos_pfn >= valid_pages) - return mce_dpages; - mce_dpages++; - dpage = pfn_to_dpage(pos_pfn + reserved_start_pfn); - if (set) - WARN_ON(__test_and_set_bit(dpage - dregion->dpage_start_pfn, - dregion->bitmap)); - else - WARN_ON(!__test_and_clear_bit(dpage - dregion->dpage_start_pfn, - dregion->bitmap)); - pos_offset = dpage_to_pfn(dpage + 1) - reserved_start_pfn; - goto try_set; -} - static void dmem_uinit_check_alloc_bitmap(struct dmem_region *dregion) { unsigned long dpages, size; @@ -989,6 +996,42 @@ void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr) } EXPORT_SYMBOL(dmem_free_pages); +bool dmem_memory_failure(unsigned long pfn, int flags) +{ + struct dmem_region *dregion; + struct dmem_node *pdnode = NULL; + u64 pos; + phys_addr_t addr = __pfn_to_phys(pfn); + bool used = false; + + dregion = find_dmem_region(addr, &pdnode); + if (!dregion) + return false; + + WARN_ON(!pdnode || !dregion->error_bitmap); + + mutex_lock(&dmem_pool.lock); + pos = pfn - __phys_to_pfn(dregion->reserved_start_addr); + if (__test_and_set_bit(pos, dregion->error_bitmap)) + goto out; + + if (!dregion->bitmap || pfn < dpage_to_pfn(dregion->dpage_start_pfn) || + pfn >= dpage_to_pfn(dregion->dpage_end_pfn)) + goto out; + + pos = phys_to_dpage(addr) - dregion->dpage_start_pfn; + if (__test_and_set_bit(pos, dregion->bitmap)) { + used = true; + } else { + pr_info("MCE: free dpage, mark %#lx disabled in dmem\n", pfn); + dnode_count_free_dpages(pdnode, -1); + } +out: + trace_dmem_memory_failure(pfn, used); + mutex_unlock(&dmem_pool.lock); + return true; +} + bool is_dmem_pfn(unsigned long pfn) { struct dmem_node *dnode; diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 5d880d4..dda45d2 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -35,6 +35,7 @@ */ #include #include +#include #include #include #include @@ -1323,6 +1324,11 @@ int memory_failure(unsigned long pfn, int flags) if (!sysctl_memory_failure_recovery) panic("Memory failure on page %lx", pfn); + if (dmem_memory_failure(pfn, flags)) { + pr_info("MCE %#lx: handled by dmem\n", pfn); + return 0; + } + p = pfn_to_online_page(pfn); if (!p) { if (pfn_valid(pfn)) { From patchwork Mon Dec 7 11:31:25 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955487 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EBFE5C4361B for ; Mon, 7 Dec 2020 11:37:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AA5A92333F for ; Mon, 7 Dec 2020 11:37:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727511AbgLGLfs (ORCPT ); Mon, 7 Dec 2020 06:35:48 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40936 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727502AbgLGLfo (ORCPT ); Mon, 7 Dec 2020 06:35:44 -0500 Received: from mail-pj1-x1043.google.com (mail-pj1-x1043.google.com [IPv6:2607:f8b0:4864:20::1043]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 27407C0613D0; Mon, 7 Dec 2020 03:35:29 -0800 (PST) Received: by mail-pj1-x1043.google.com with SMTP id h7so6860858pjk.1; Mon, 07 Dec 2020 03:35:29 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=nVWSYdYM+m3VgpFdUaNYp2nPK15v+oLFac3l4i+Xkck=; b=s62bw1Ds5PE8L4XzidHHxWO7dowi92tTp2jvyxNTtrK0syogpxYuEVLqaYD0RVv631 owhJ9XHEiEAUC9LLWjMTm9Pe+rm9eGDdeJHWBiBJ6IybF16oZrZVOSyni6mDTkHbJNtj lV1WDgxMDJ4vPzpOm/VOel/dhx+0n4F5S7mRxVlZDCS15ym9YM5QeoViAugifkeeSQ6P OfHGghB4s0mBvhs1ZJML1PDJYbX13uJtxvR79+agbDqhEkF5cdyciPfPgh9PVVRxkp45 i3OyRnwn9i+vu3ayrr5iEDIPr3IawOVrT3n6PBt+AqhUVLQRTIcKvdEj11xd5/VzMArO ygQg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=nVWSYdYM+m3VgpFdUaNYp2nPK15v+oLFac3l4i+Xkck=; b=MaYTHrXpbwnyGD9+EGn7u+2dPbNGtRjPJR3qQR3OvKaQwBkmV8JiuhMdC+5Kyu8v2W QlmV5okehYpdxziN/OCV89Dg0D/Qbm4hKzUvDbDHXeJ+EckPFBWj/ZTzdZx+ZFzmyY8y GaesL4X0pOMyy5WcGvLnpkYRE1kzJYVPMYeBnmgF1osQIss/Cz/6rIgF3F4Q0iSFAJ4b 5p507f8l8wF+c4LhKs01lNjEwSAPfAZJ153E6G2uaKU8tGhc4fOWkZESaRsQAixehkcY sGYrppzR8UcDtEB8oo3OlCYcPgLc6u9b1LtsANPGLNww7TTSsu+qbChxE7O93AJ9hVkZ QfFg== X-Gm-Message-State: AOAM530niNai0um2vqyQIXT0UwhwYf/Fnvmp//XIhm3wLrQuj8UYbhoP TkyXiD6J519r1xevAc6Gf4A= X-Google-Smtp-Source: ABdhPJxbhoVS4O5do45wTEXma6Md3j6oREnROj3xA+nTQrYjV+3OjcYJRxFen4GD2PNOYOwIlCI0hQ== X-Received: by 2002:a17:90a:e005:: with SMTP id u5mr15885574pjy.64.1607340928636; Mon, 07 Dec 2020 03:35:28 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.35.25 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:35:28 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Haiwei Li Subject: [RFC V2 32/37] mm, dmemfs: register and handle the dmem mce Date: Mon, 7 Dec 2020 19:31:25 +0800 Message-Id: <2c95c5ed91e84229a234d243b8660e1b9cab8bbd.1607332046.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang dmemfs register the mce handler, send signal to the procs whose vma is mapped in mce pfn. Signed-off-by: Haiwei Li Signed-off-by: Yulei Zhang --- fs/dmemfs/inode.c | 141 +++++++++++++++++++++++++++++++++++++++++++++++++++ include/linux/dmem.h | 7 +++ include/linux/mm.h | 2 + mm/dmem.c | 34 +++++++++++++ mm/memory-failure.c | 64 ++++++++++++++++------- 5 files changed, 231 insertions(+), 17 deletions(-) diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index f698b9d..4303bcdc 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -36,6 +36,47 @@ static uint __read_mostly max_alloc_try_dpages = 1; +struct dmemfs_inode { + struct inode *inode; + struct list_head link; +}; + +static LIST_HEAD(dmemfs_inode_list); +static DEFINE_SPINLOCK(dmemfs_inode_lock); + +static struct dmemfs_inode * +dmemfs_create_dmemfs_inode(struct inode *inode) +{ + struct dmemfs_inode *dmemfs_inode; + + spin_lock(&dmemfs_inode_lock); + dmemfs_inode = kmalloc(sizeof(struct dmemfs_inode), GFP_NOIO); + if (!dmemfs_inode) { + pr_err("DMEMFS: Out of memory while getting dmemfs inode\n"); + goto out; + } + dmemfs_inode->inode = inode; + list_add_tail(&dmemfs_inode->link, &dmemfs_inode_list); +out: + spin_unlock(&dmemfs_inode_lock); + return dmemfs_inode; +} + +static void dmemfs_delete_dmemfs_inode(struct inode *inode) +{ + struct dmemfs_inode *i, *next; + + spin_lock(&dmemfs_inode_lock); + list_for_each_entry_safe(i, next, &dmemfs_inode_list, link) { + if (i->inode == inode) { + list_del(&i->link); + kfree(i); + break; + } + } + spin_unlock(&dmemfs_inode_lock); +} + struct dmemfs_mount_opts { unsigned long dpage_size; }; @@ -218,6 +259,13 @@ static unsigned long dmem_pgoff_to_index(struct inode *inode, pgoff_t pgoff) return pgoff >> (sb->s_blocksize_bits - PAGE_SHIFT); } +static pgoff_t dmem_index_to_pgoff(struct inode *inode, unsigned long index) +{ + struct super_block *sb = inode->i_sb; + + return index << (sb->s_blocksize_bits - PAGE_SHIFT); +} + static void *dmem_addr_to_entry(struct inode *inode, phys_addr_t addr) { struct super_block *sb = inode->i_sb; @@ -806,6 +854,23 @@ static void dmemfs_evict_inode(struct inode *inode) clear_inode(inode); } +static struct inode *dmemfs_alloc_inode(struct super_block *sb) +{ + struct inode *inode; + + inode = alloc_inode_nonrcu(); + if (inode) + dmemfs_create_dmemfs_inode(inode); + return inode; +} + +static void dmemfs_destroy_inode(struct inode *inode) +{ + if (inode) + dmemfs_delete_dmemfs_inode(inode); + free_inode_nonrcu(inode); +} + /* * Display the mount options in /proc/mounts. */ @@ -819,9 +884,11 @@ static int dmemfs_show_options(struct seq_file *m, struct dentry *root) } static const struct super_operations dmemfs_ops = { + .alloc_inode = dmemfs_alloc_inode, .statfs = dmemfs_statfs, .evict_inode = dmemfs_evict_inode, .drop_inode = generic_delete_inode, + .destroy_inode = dmemfs_destroy_inode, .show_options = dmemfs_show_options, }; @@ -901,17 +968,91 @@ static void dmemfs_kill_sb(struct super_block *sb) .kill_sb = dmemfs_kill_sb, }; +static struct inode * +dmemfs_find_inode_by_addr(phys_addr_t addr, pgoff_t *pgoff) +{ + struct dmemfs_inode *di; + struct inode *inode; + struct address_space *mapping; + void *entry, **slot; + void *mce_entry; + + list_for_each_entry(di, &dmemfs_inode_list, link) { + inode = di->inode; + mapping = inode->i_mapping; + mce_entry = dmem_addr_to_entry(inode, addr); + XA_STATE(xas, &mapping->i_pages, 0); + rcu_read_lock(); + + xas_for_each(&xas, entry, ULONG_MAX) { + if (xas_retry(&xas, entry)) + continue; + + if (unlikely(entry != xas_reload(&xas))) + goto retry; + + if (mce_entry != entry) + continue; + *pgoff = dmem_index_to_pgoff(inode, xas.xa_index); + rcu_read_unlock(); + return inode; +retry: + xas_reset(&xas); + } + rcu_read_unlock(); + } + return NULL; +} + +static int dmemfs_mce_handler(struct notifier_block *this, unsigned long pfn, + void *v) +{ + struct dmem_mce_notifier_info *info = + (struct dmem_mce_notifier_info *)v; + int flags = info->flags; + struct inode *inode; + phys_addr_t mce_addr = __pfn_to_phys(pfn); + pgoff_t pgoff; + + spin_lock(&dmemfs_inode_lock); + inode = dmemfs_find_inode_by_addr(mce_addr, &pgoff); + if (!inode || !atomic_read(&inode->i_count)) + goto out; + + collect_procs_and_signal_inode(inode, pgoff, pfn, flags); +out: + spin_unlock(&dmemfs_inode_lock); + return 0; +} + +static struct notifier_block dmemfs_mce_notifier = { + .notifier_call = dmemfs_mce_handler, +}; + static int __init dmemfs_init(void) { int ret; + pr_info("dmemfs initialized\n"); ret = register_filesystem(&dmemfs_fs_type); + if (ret) + goto reg_fs_fail; + + ret = dmem_register_mce_notifier(&dmemfs_mce_notifier); + if (ret) + goto reg_notifier_fail; + return 0; + +reg_notifier_fail: + unregister_filesystem(&dmemfs_fs_type); +reg_fs_fail: return ret; } static void __exit dmemfs_uninit(void) { + dmem_unregister_mce_notifier(&dmemfs_mce_notifier); unregister_filesystem(&dmemfs_fs_type); } diff --git a/include/linux/dmem.h b/include/linux/dmem.h index cd17a91..fe0b270 100644 --- a/include/linux/dmem.h +++ b/include/linux/dmem.h @@ -23,6 +23,13 @@ #define dmem_free_page(addr) dmem_free_pages(addr, 1) bool dmem_memory_failure(unsigned long pfn, int flags); + +struct dmem_mce_notifier_info { + int flags; +}; + +int dmem_register_mce_notifier(struct notifier_block *nb); +int dmem_unregister_mce_notifier(struct notifier_block *nb); #else static inline int dmem_reserve_init(void) { diff --git a/include/linux/mm.h b/include/linux/mm.h index 2f3135fe..fa20f9c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3041,6 +3041,8 @@ enum mf_flags { extern void memory_failure_queue(unsigned long pfn, int flags); extern void memory_failure_queue_kick(int cpu); extern int unpoison_memory(unsigned long pfn); +extern void collect_procs_and_signal_inode(struct inode *inode, pgoff_t pgoff, + unsigned long pfn, int flags); extern int sysctl_memory_failure_early_kill; extern int sysctl_memory_failure_recovery; extern void shake_page(struct page *p, int access); diff --git a/mm/dmem.c b/mm/dmem.c index 16438db..dd81b24 100644 --- a/mm/dmem.c +++ b/mm/dmem.c @@ -70,6 +70,7 @@ struct dmem_node { struct dmem_pool { struct mutex lock; + struct raw_notifier_head mce_notifier_chain; unsigned long region_num; unsigned long registered_pages; @@ -92,6 +93,7 @@ struct dmem_pool { static struct dmem_pool dmem_pool = { .lock = __MUTEX_INITIALIZER(dmem_pool.lock), + .mce_notifier_chain = RAW_NOTIFIER_INIT(dmem_pool.mce_notifier_chain), }; #define DMEM_PAGE_SIZE (1UL << dmem_pool.dpage_shift) @@ -121,6 +123,35 @@ struct dmem_pool { #define for_each_dmem_region(_dnode, _dregion) \ list_for_each_entry(_dregion, &(_dnode)->regions, node) +int dmem_register_mce_notifier(struct notifier_block *nb) +{ + int ret; + + mutex_lock(&dmem_pool.lock); + ret = raw_notifier_chain_register(&dmem_pool.mce_notifier_chain, nb); + mutex_unlock(&dmem_pool.lock); + return ret; +} +EXPORT_SYMBOL(dmem_register_mce_notifier); + +int dmem_unregister_mce_notifier(struct notifier_block *nb) +{ + int ret; + + mutex_lock(&dmem_pool.lock); + ret = raw_notifier_chain_unregister(&dmem_pool.mce_notifier_chain, nb); + mutex_unlock(&dmem_pool.lock); + return ret; +} +EXPORT_SYMBOL(dmem_unregister_mce_notifier); + +static int dmem_mce_notify(unsigned long pfn, + struct dmem_mce_notifier_info *info) +{ + return raw_notifier_call_chain(&dmem_pool.mce_notifier_chain, + pfn, info); +} + static inline int *dmem_nodelist(int nid) { return nid_to_dnode(nid)->nodelist; @@ -1003,6 +1034,7 @@ bool dmem_memory_failure(unsigned long pfn, int flags) u64 pos; phys_addr_t addr = __pfn_to_phys(pfn); bool used = false; + struct dmem_mce_notifier_info info; dregion = find_dmem_region(addr, &pdnode); if (!dregion) @@ -1022,6 +1054,8 @@ bool dmem_memory_failure(unsigned long pfn, int flags) pos = phys_to_dpage(addr) - dregion->dpage_start_pfn; if (__test_and_set_bit(pos, dregion->bitmap)) { used = true; + info.flags = flags; + dmem_mce_notify(pfn, &info); } else { pr_info("MCE: free dpage, mark %#lx disabled in dmem\n", pfn); dnode_count_free_dpages(pdnode, -1); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index dda45d2..3aa7fe7 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -334,8 +334,8 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page, * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM. */ static void add_to_kill(struct task_struct *tsk, struct page *p, - struct vm_area_struct *vma, - struct list_head *to_kill) + struct vm_area_struct *vma, unsigned long pfn, + pgoff_t pgoff, struct list_head *to_kill) { struct to_kill *tk; @@ -345,12 +345,17 @@ static void add_to_kill(struct task_struct *tsk, struct page *p, return; } - tk->addr = page_address_in_vma(p, vma); - if (is_zone_device_page(p)) - tk->size_shift = dev_pagemap_mapping_shift(p, vma); - else - tk->size_shift = page_shift(compound_head(p)); - + if (p) { + tk->addr = page_address_in_vma(p, vma); + if (is_zone_device_page(p)) + tk->size_shift = dev_pagemap_mapping_shift(p, vma); + else + tk->size_shift = page_shift(compound_head(p)); + } else { + tk->size_shift = PAGE_SHIFT; + tk->addr = vma->vm_start + + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); + } /* * Send SIGKILL if "tk->addr == -EFAULT". Also, as * "tk->size_shift" is always non-zero for !is_zone_device_page(), @@ -363,7 +368,7 @@ static void add_to_kill(struct task_struct *tsk, struct page *p, */ if (tk->addr == -EFAULT) { pr_info("Memory failure: Unable to find user space address %lx in %s\n", - page_to_pfn(p), tsk->comm); + pfn, tsk->comm); } else if (tk->size_shift == 0) { kfree(tk); return; @@ -496,7 +501,8 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill, if (!page_mapped_in_vma(page, vma)) continue; if (vma->vm_mm == t->mm) - add_to_kill(t, page, vma, to_kill); + add_to_kill(t, page, vma, page_to_pfn(page), + page_to_pgoff(page), to_kill); } } read_unlock(&tasklist_lock); @@ -504,19 +510,17 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill, } /* - * Collect processes when the error hit a file mapped page. + * Collect processes when the error hit a file mapped memory. */ -static void collect_procs_file(struct page *page, struct list_head *to_kill, - int force_early) +static void __collect_procs_file(struct address_space *mapping, pgoff_t pgoff, + struct page *page, unsigned long pfn, + struct list_head *to_kill, int force_early) { struct vm_area_struct *vma; struct task_struct *tsk; - struct address_space *mapping = page->mapping; - pgoff_t pgoff; i_mmap_lock_read(mapping); read_lock(&tasklist_lock); - pgoff = page_to_pgoff(page); for_each_process(tsk) { struct task_struct *t = task_early_kill(tsk, force_early); @@ -532,7 +536,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, * to be informed of all such data corruptions. */ if (vma->vm_mm == t->mm) - add_to_kill(t, page, vma, to_kill); + add_to_kill(t, page, vma, pfn, pgoff, to_kill); } } read_unlock(&tasklist_lock); @@ -540,6 +544,32 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, } /* + * Collect processes when the error hit a file mapped page. + */ +static void collect_procs_file(struct page *page, struct list_head *to_kill, + int force_early) +{ + struct address_space *mapping = page->mapping; + + __collect_procs_file(mapping, page_to_pgoff(page), page, + page_to_pfn(page), to_kill, force_early); +} + +void collect_procs_and_signal_inode(struct inode *inode, pgoff_t pgoff, + unsigned long pfn, int flags) +{ + int forcekill; + struct address_space *mapping = &inode->i_data; + LIST_HEAD(tokill); + + __collect_procs_file(mapping, pgoff, NULL, pfn, &tokill, + flags & MF_ACTION_REQUIRED); + forcekill = flags & MF_MUST_KILL; + kill_procs(&tokill, forcekill, false, pfn, flags); +} +EXPORT_SYMBOL(collect_procs_and_signal_inode); + +/* * Collect the processes who have the corrupted page mapped to kill. */ static void collect_procs(struct page *page, struct list_head *tokill, From patchwork Mon Dec 7 11:31:26 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955485 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 12088C433FE for ; Mon, 7 Dec 2020 11:37:01 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D5F57233A0 for ; Mon, 7 Dec 2020 11:37:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727647AbgLGLgO (ORCPT ); Mon, 7 Dec 2020 06:36:14 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41132 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727496AbgLGLgM (ORCPT ); Mon, 7 Dec 2020 06:36:12 -0500 Received: from mail-pj1-x1043.google.com (mail-pj1-x1043.google.com [IPv6:2607:f8b0:4864:20::1043]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2B869C061A51; Mon, 7 Dec 2020 03:35:32 -0800 (PST) Received: by mail-pj1-x1043.google.com with SMTP id iq13so7183740pjb.3; Mon, 07 Dec 2020 03:35:32 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ve50LBxe7KPsa/afhSPdadx4mGbx0d5NNuSu5BfW+ws=; b=iGPfYHDT9qrLj7NFYNDa99CfU/LeTtf1xtrQ4M7/y/WnH1dn41NqjJOU3FG2cI2rI1 h0kfX9f8shUAZZMOJgGaejDmBdM+kopryJtCrKgU0WydjatlKrXuJdW2bF7peb3YE73w HuxE6iS0CuXe4q7GR4g4kSkSH6FnTepRTIMRjxotGxhIeEy3ukOJoI5oZIXRy7bWfhgi IvKePLSYBnWHUrmlhAUkfkx96TSLs4I/L6WgpPAkzxznqxqFcO+gGTdsC/FaANWeyvGZ qVOLpBLEzfoJSjxHnOcThPOkYd3adliwtBA5VZ8npQ1fRXKU/uPKqTcL3sx4i7jwqU37 jGTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ve50LBxe7KPsa/afhSPdadx4mGbx0d5NNuSu5BfW+ws=; b=cu5oQ2xVkvv8YjCyvP/SECkAXF9FTuwciqkIffybDTP7CgTmT21VdALgVFrLa3L1Y9 UyCsmwIP+3eUi6Kbprk/6JM63sl7EFNzi4PgiktwBGK4NMa4sMauea/u8cFz9i5ciGKi EkKzSmya2jVX0k0g3HvxI7J+op17lHb2xG8I5wBvSgqxF3Bbyw9B6kzvmAN8rtOjnepZ cprWdf/dXskV2vo5oCfcH6gcJTG1K7DBFHPff8k/CyTCaEeY8TxWbr5z0ZnECGPyI/ie nrr0NqTV1q+7yxhOrWlYYWU7cFfoBsmAnaIHDUiOF24HJCOidOsvSGDzkXHUnFa1OtMi fPlQ== X-Gm-Message-State: AOAM532+lmkxrzAJ2ckvxqTPoMJz4PDIsTPycr1DPMnwHubielLibSqc LlcuUBZODobbQbMZw1pte7WtD0IsvP4= X-Google-Smtp-Source: ABdhPJw+yEq36n4uncdVGAdYzZ2qqW7UOAZi3abFVTKDELRoe1bqIX3q7doUgLIBaZmdlgELZCPS2A== X-Received: by 2002:a17:90b:1945:: with SMTP id nk5mr15957725pjb.30.1607340931753; Mon, 07 Dec 2020 03:35:31 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.35.28 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:35:31 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang Subject: [RFC V2 33/37] kvm, x86: enable record_steal_time for dmem Date: Mon, 7 Dec 2020 19:31:26 +0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang Adjust the kvm_map_gfn while using dmemfs to enable record_steal_time when entering the guest. Signed-off-by: Yulei Zhang --- virt/kvm/kvm_main.c | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 2541a17..500b170 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -51,6 +51,7 @@ #include #include #include +#include #include #include @@ -2164,7 +2165,10 @@ static int __kvm_map_gfn(struct kvm_memslots *slots, gfn_t gfn, hva = kmap(page); #ifdef CONFIG_HAS_IOMEM } else if (!atomic) { - hva = memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB); + if (is_dmem_pfn(pfn)) + hva = __va(PFN_PHYS(pfn)); + else + hva = memremap(pfn_to_hpa(pfn), PAGE_SIZE, MEMREMAP_WB); } else { return -EINVAL; #endif @@ -2214,9 +2218,10 @@ static void __kvm_unmap_gfn(struct kvm_memory_slot *memslot, kunmap(map->page); } #ifdef CONFIG_HAS_IOMEM - else if (!atomic) - memunmap(map->hva); - else + else if (!atomic) { + if (!is_dmem_pfn(map->pfn)) + memunmap(map->hva); + } else WARN_ONCE(1, "Unexpected unmapping in atomic context"); #endif From patchwork Mon Dec 7 11:31:27 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955483 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DB6F8C2BBD4 for ; Mon, 7 Dec 2020 11:36:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B4A92233A0 for ; Mon, 7 Dec 2020 11:36:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727670AbgLGLgR (ORCPT ); Mon, 7 Dec 2020 06:36:17 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41146 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727648AbgLGLgQ (ORCPT ); Mon, 7 Dec 2020 06:36:16 -0500 Received: from mail-pg1-x544.google.com (mail-pg1-x544.google.com [IPv6:2607:f8b0:4864:20::544]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BFF4BC061A56; Mon, 7 Dec 2020 03:35:35 -0800 (PST) Received: by mail-pg1-x544.google.com with SMTP id w16so8642093pga.9; Mon, 07 Dec 2020 03:35:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=2oI/oh3gTLcVKVrjU4mtMVnwaLDwsHWGDZd/m897JZk=; b=lhjKgDf+8F/uFxn5RBaOSzYX0LvT73tAN0Gg+Z8ChDCSn1g21h9K7z03ULsA5x3FO/ 2a+Eg0te0MGsaduIR3o9k0QLaF/mI8tGtephTk2VDC0HtpswcYPbfNlwIcireWiJ3lEi YBLTiakDlLO8KbJjbiP2fFHdnJI2rSybJZ1rp2A8mhn/yuJGj2RdiE8Y7GUN+76U1u3F grW+T/JmrPwOi4a+Ia/MdYqiJ3odiRzcVjQEiwxt6frfkjpO1sZ3bLJV1WGZr7av8hnV KBYDpv1nu/d9yh0K3QIlXoA278BPZLPtVPY9jfVMUeGQEixWS3nOF9qEgg52Qo3REX5+ jePQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=2oI/oh3gTLcVKVrjU4mtMVnwaLDwsHWGDZd/m897JZk=; b=S3+HlXwLU+dh2EF7RJsZ9ZeYh6bv24koic/3vXshwYR17eNG0dEBVynz1K44JX0B4n fjs4v1oDWkLmUOpVRuOKdSi7/pe7Xs45Vx14LDG7bZhpGzHD9nyM+AXIIwMNgV1OVcR8 E/ntPQd7U2RnoeFdiUX9uc9I+7QsU4WOUv5QPF0G0K5/Ra9Y5pLQv33hJbtR+InlvYXo 6Oi7ho2jqb0YQdCdW0cV2cXPTeDO+BAoBnba0nR5yEgR73y1aQB3WmB77hS4SMezG1Q+ GLIK92Yn0yHsiggln6HkQ4Mn4soaHbDOXHvlhdmx+ddOHg+4xZOfxoYtMinvuwXimfh1 3rwg== X-Gm-Message-State: AOAM530jJdeGiTWNBjsYs7FBXZeCDzz44n6lzLw79W4601ZujfrXpLO6 or3hy7XoT8TSnztNQOAy4iM= X-Google-Smtp-Source: ABdhPJxk408hICw1VHxdPSSZFmVYiF7oejg12uAdXoM+nWbeUsNih67nBm2DaYDp/HkVlqXXpfd2hQ== X-Received: by 2002:a17:902:fe17:b029:da:799a:8bfd with SMTP id g23-20020a170902fe17b02900da799a8bfdmr15667660plj.10.1607340935397; Mon, 07 Dec 2020 03:35:35 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.35.32 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:35:34 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [RFC V2 34/37] dmem: add dmem unit tests Date: Mon, 7 Dec 2020 19:31:27 +0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang This test case is used to test dmem management system. Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- tools/testing/dmem/Kbuild | 1 + tools/testing/dmem/Makefile | 10 +++ tools/testing/dmem/dmem-test.c | 184 +++++++++++++++++++++++++++++++++++++++++ 3 files changed, 195 insertions(+) create mode 100644 tools/testing/dmem/Kbuild create mode 100644 tools/testing/dmem/Makefile create mode 100644 tools/testing/dmem/dmem-test.c diff --git a/tools/testing/dmem/Kbuild b/tools/testing/dmem/Kbuild new file mode 100644 index 00000000..04988f7 --- /dev/null +++ b/tools/testing/dmem/Kbuild @@ -0,0 +1 @@ +obj-m += dmem-test.o diff --git a/tools/testing/dmem/Makefile b/tools/testing/dmem/Makefile new file mode 100644 index 00000000..21f141f --- /dev/null +++ b/tools/testing/dmem/Makefile @@ -0,0 +1,10 @@ +KDIR ?= ../../../ + +default: + $(MAKE) -C $(KDIR) M=$$PWD + +install: default + $(MAKE) -C $(KDIR) M=$$PWD modules_install + +clean: + rm -f *.o *.ko Module.* modules.* *.mod.c diff --git a/tools/testing/dmem/dmem-test.c b/tools/testing/dmem/dmem-test.c new file mode 100644 index 00000000..4baae18 --- /dev/null +++ b/tools/testing/dmem/dmem-test.c @@ -0,0 +1,184 @@ +/* + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include +#include +#include +#include +#include +#include +#include +#include + +struct dmem_mem_node { + struct list_head node; +}; + +static LIST_HEAD(dmem_list); + +static int dmem_test_alloc_init(unsigned long dpage_shift) +{ + int ret; + + ret = dmem_alloc_init(dpage_shift); + if (ret) + pr_info("dmem_alloc_init failed, dpage_shift %ld ret=%d\n", + dpage_shift, ret); + return ret; +} + +static int __dmem_test_alloc(int order, int nid, nodemask_t *nodemask, + const char *caller) +{ + struct dmem_mem_node *pos; + phys_addr_t addr; + int i, ret = 0; + + for (i = 0; i < (1 << order); i++) { + addr = dmem_alloc_pages_nodemask(nid, nodemask, 1, NULL); + if (!addr) { + ret = -ENOMEM; + break; + } + + pos = __va(addr); + list_add(&pos->node, &dmem_list); + } + + pr_info("%s: alloc order %d on node %d has fallback node %s... %s.\n", + caller, order, nid, nodemask ? "yes" : "no", + !ret ? "okay" : "failed"); + + return ret; +} + +static void dmem_test_free_all(void) +{ + struct dmem_mem_node *pos, *n; + + list_for_each_entry_safe(pos, n, &dmem_list, node) { + list_del(&pos->node); + dmem_free_page(__pa(pos)); + } +} + +#define dmem_test_alloc(order, nid, nodemask) \ + __dmem_test_alloc(order, nid, nodemask, __func__) + +/* dmem shoud have 2^6 native pages available at lest */ +static int order_test(void) +{ + int order, i, ret; + int page_orders[] = {0, 1, 2, 3, 4, 5, 6}; + + ret = dmem_test_alloc_init(PAGE_SHIFT); + if (ret) + return ret; + + for (i = 0; i < ARRAY_SIZE(page_orders); i++) { + order = page_orders[i]; + + ret = dmem_test_alloc(order, numa_node_id(), NULL); + if (ret) + break; + } + + dmem_test_free_all(); + + dmem_alloc_uinit(); + + return ret; +} + +static int node_test(void) +{ + nodemask_t nodemask; + unsigned long nr = 0; + int order; + int node; + int ret = 0; + + order = 0; + + ret = dmem_test_alloc_init(PUD_SHIFT); + if (ret) + return ret; + + pr_info("%s: test allocation on node 0\n", __func__); + node = 0; + nodes_clear(nodemask); + node_set(0, nodemask); + + ret = dmem_test_alloc(order, node, &nodemask); + if (ret) + goto exit; + + dmem_test_free_all(); + + pr_info("%s: begin to exhaust dmem on node 0.\n", __func__); + node = 1; + nodes_clear(nodemask); + node_set(0, nodemask); + + INIT_LIST_HEAD(&dmem_list); + while (!(ret = dmem_test_alloc(order, node, &nodemask))) + nr++; + + pr_info("Allocation on node 0 success times: %lu\n", nr); + + pr_info("%s: allocation on node 0 again\n", __func__); + node = 0; + nodes_clear(nodemask); + node_set(0, nodemask); + ret = dmem_test_alloc(order, node, &nodemask); + if (!ret) { + pr_info("\tNot expected fallback\n"); + ret = -1; + } else { + ret = 0; + pr_info("\tOK, Dmem on node 0 exhausted, fallback success\n"); + } + + pr_info("%s: Release dmem\n", __func__); + dmem_test_free_all(); + +exit: + dmem_alloc_uinit(); + return ret; +} + +static __init int dmem_test_init(void) +{ + int ret; + + pr_info("dmem: test init...\n"); + + ret = order_test(); + if (ret) + return ret; + + ret = node_test(); + + + if (ret) + pr_info("dmem test fail, ret=%d\n", ret); + else + pr_info("dmem test success\n"); + return ret; +} + +static __exit void dmem_test_exit(void) +{ + pr_info("dmem: test exit...\n"); +} + +module_init(dmem_test_init); +module_exit(dmem_test_exit); +MODULE_LICENSE("GPL v2"); From patchwork Mon Dec 7 11:31:28 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955479 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 03412C4361B for ; Mon, 7 Dec 2020 11:36:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AA930233A1 for ; Mon, 7 Dec 2020 11:36:24 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727612AbgLGLfz (ORCPT ); Mon, 7 Dec 2020 06:35:55 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41012 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727566AbgLGLfy (ORCPT ); Mon, 7 Dec 2020 06:35:54 -0500 Received: from mail-pf1-x442.google.com (mail-pf1-x442.google.com [IPv6:2607:f8b0:4864:20::442]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0E04AC0613D1; Mon, 7 Dec 2020 03:35:39 -0800 (PST) Received: by mail-pf1-x442.google.com with SMTP id t8so9610762pfg.8; Mon, 07 Dec 2020 03:35:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=ZbMU89dV/uL3+8OuWgp1O7gQ5pASlyw5FoQtpms6guk=; b=hJ834cKhEjJVNZZ3t967QVJ2E8W9K6mc6jhd3ZkKT1XWsW3Git/iP8YVfRmQyt8sIr CjbTX37cTnlCLLuno8hPGkGDUmoZ1BtiEYJXX1b6z90vTqSR1KA8+8rId4N/7TnizniZ enCmfy4akOOpsgOGcbpB931EuqRfgpj+h5gaDLkFjyYbBZ9e/Dda80jgLj0M5Cqc6r1u hwxdv353KdyhkQxcMeQ8uHs9JbJAVgoln+7QovXo3A8RUP62nOLFJKHy2/oiSVfloI6E 22pFGJvSpZrxXWrD0eCtE1IJmN+S1qu2/O18aqpz1ZdL8sqSQQUdbZk/09SaW4dU8ASj 9dEg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=ZbMU89dV/uL3+8OuWgp1O7gQ5pASlyw5FoQtpms6guk=; b=TuQKFpf29+zpYdMBBWHrMu3N8QBrwJjB42Hc1f6hP+EcGHRS5ZMmvda/c/wW4Fs6Kh fx+gRJmtD63gFrOFpaDzx7uB3S80/qkdPH5Fl4CZOKFfQEW+a7fMrWqyXdp3cKHNX2x2 acjpQwOVa3ZpU81iiMlCMBKGE2/Lf4SCAmIsaoVd4X4f7Vey+5SvvZcUQMh9IITFSuGc i1Acew3F0hHpltS8yCVe5tKl8YlzZbLidunmDjn807Sc3gxihmHRZp6Dsg1cqbmoWjhg ZZVTKaoo/1ekRGFn1SDSP+TjXn6RPlyGVKRzqhILhox3qQgFurVKYtrSnr6/LP61EOVX WZjw== X-Gm-Message-State: AOAM530QiFsVAvww30mbEOke0nagdq8yoKsRlJAKsJsb7GeNe+Gt7tp6 QWypeEja9Zr+5wyX6qoTq/drpNPkUvU= X-Google-Smtp-Source: ABdhPJytxl0Aj3b5REF0I4awOHOqfBomUWPVwgLlHgKjMQw2BuQ5IeQCZdBfDO1sTV2IIFA1EZ86zQ== X-Received: by 2002:a63:4905:: with SMTP id w5mr17945642pga.124.1607340938617; Mon, 07 Dec 2020 03:35:38 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.35.35 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:35:38 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 35/37] mm, dmem: introduce dregion->memmap for dmem Date: Mon, 7 Dec 2020 19:31:28 +0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang Append 'memmap' into struct dmem_region, mapping each page of dmem with struct dmempage. Currently there is just one member '_refcount' in struct dmempage to reflect the number of all modules which occupied the dmem page. Modules which allocates the dmem page from dmempool will make first reference and set _refcount to 1. Modules which try to free the dmem page to dmempool will decrease 1 at _refcount and free it if _refcount is tested as zero after decrease. At each time module A passes dmem page to module B, module B should call get_dmem_pfn() to increase _refcount for dmem page before making use of it to avoid referencing a dmem page which is occasionally freeed by any other module in parallel. Vice versa after finishing usage of that dmem page need call put_dmem_pfn() to decrease the _refcount. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- include/linux/dmem.h | 5 ++ mm/dmem.c | 147 ++++++++++++++++++++++++++++++++++++++++++++++----- 2 files changed, 139 insertions(+), 13 deletions(-) diff --git a/include/linux/dmem.h b/include/linux/dmem.h index fe0b270..8aaa80b 100644 --- a/include/linux/dmem.h +++ b/include/linux/dmem.h @@ -22,6 +22,9 @@ bool is_dmem_pfn(unsigned long pfn); #define dmem_free_page(addr) dmem_free_pages(addr, 1) +void get_dmem_pfn(unsigned long pfn); +#define put_dmem_pfn(pfn) dmem_free_page(PFN_PHYS(pfn)) + bool dmem_memory_failure(unsigned long pfn, int flags); struct dmem_mce_notifier_info { @@ -45,5 +48,7 @@ static inline bool dmem_memory_failure(unsigned long pfn, int flags) { return false; } +void get_dmem_pfn(unsigned long pfn) {} +void put_dmem_pfn(unsigned long pfn) {} #endif #endif /* _LINUX_DMEM_H */ diff --git a/mm/dmem.c b/mm/dmem.c index dd81b24..776dbf2 100644 --- a/mm/dmem.c +++ b/mm/dmem.c @@ -47,6 +47,7 @@ struct dmem_region { unsigned long static_error_bitmap; unsigned long *error_bitmap; + void *memmap; }; /* @@ -91,6 +92,10 @@ struct dmem_pool { struct dmem_node nodes[MAX_NUMNODES]; }; +struct dmempage { + atomic_t _refcount; +}; + static struct dmem_pool dmem_pool = { .lock = __MUTEX_INITIALIZER(dmem_pool.lock), .mce_notifier_chain = RAW_NOTIFIER_INIT(dmem_pool.mce_notifier_chain), @@ -123,6 +128,40 @@ struct dmem_pool { #define for_each_dmem_region(_dnode, _dregion) \ list_for_each_entry(_dregion, &(_dnode)->regions, node) +#define pfn_to_dmempage(_pfn, _dregion) \ + ((struct dmempage *)(_dregion)->memmap + \ + pfn_to_dpage(_pfn) - (_dregion)->dpage_start_pfn) + +#define dmempage_to_dpage(_dmempage, _dregion) \ + ((_dmempage) - (struct dmempage *)(_dregion)->memmap + \ + (_dregion)->dpage_start_pfn) + +static inline int dmempage_count(struct dmempage *dmempage) +{ + return atomic_read(&dmempage->_refcount); +} + +static inline void set_dmempage_count(struct dmempage *dmempage, int v) +{ + atomic_set(&dmempage->_refcount, v); +} + +static inline void dmempage_ref_inc(struct dmempage *dmempage) +{ + atomic_inc(&dmempage->_refcount); +} + +static inline int dmempage_ref_dec_and_test(struct dmempage *dmempage) +{ + return atomic_dec_and_test(&dmempage->_refcount); +} + +static inline int put_dmempage_testzero(struct dmempage *dmempage) +{ + VM_BUG_ON(dmempage_count(dmempage) == 0); + return dmempage_ref_dec_and_test(dmempage); +} + int dmem_register_mce_notifier(struct notifier_block *nb) { int ret; @@ -559,10 +598,25 @@ static int __init dmem_late_init(void) } late_initcall(dmem_late_init); +static void *dmem_memmap_alloc(unsigned long dpages) +{ + unsigned long size; + + size = dpages * sizeof(struct dmempage); + return vzalloc(size); +} + +static void dmem_memmap_free(void *memmap) +{ + if (memmap) + vfree(memmap); +} + static int dmem_alloc_region_init(struct dmem_region *dregion, unsigned long *dpages) { unsigned long start, end, *bitmap; + void *memmap; start = DMEM_PAGE_UP(dregion->reserved_start_addr); end = DMEM_PAGE_DOWN(dregion->reserved_end_addr); @@ -575,7 +629,14 @@ static int dmem_alloc_region_init(struct dmem_region *dregion, if (!bitmap) return -ENOMEM; + memmap = dmem_memmap_alloc(*dpages); + if (!memmap) { + dmem_bitmap_free(*dpages, bitmap, &dregion->static_bitmap); + return -ENOMEM; + } + dregion->bitmap = bitmap; + dregion->memmap = memmap; dregion->next_free_pos = 0; dregion->dpage_start_pfn = start; dregion->dpage_end_pfn = end; @@ -650,7 +711,9 @@ static void dmem_alloc_region_uinit(struct dmem_region *dregion) dmem_uinit_check_alloc_bitmap(dregion); dmem_bitmap_free(dpages, bitmap, &dregion->static_bitmap); + dmem_memmap_free(dregion->memmap); dregion->bitmap = NULL; + dregion->memmap = NULL; } static void __dmem_alloc_uinit(void) @@ -793,6 +856,16 @@ int dmem_alloc_init(unsigned long dpage_shift) return dpage_to_phys(dregion->dpage_start_pfn + pos); } +static void prep_new_dmempage(unsigned long phys, unsigned int nr, + struct dmem_region *dregion) +{ + struct dmempage *dmempage = pfn_to_dmempage(PHYS_PFN(phys), dregion); + unsigned int i; + + for (i = 0; i < nr; i++, dmempage++) + set_dmempage_count(dmempage, 1); +} + /* * allocate dmem pages from the nodelist * @@ -839,6 +912,7 @@ int dmem_alloc_init(unsigned long dpage_shift) if (addr) { dnode_count_free_dpages(dnode, -(long)(*result_nr)); + prep_new_dmempage(addr, *result_nr, dregion); break; } } @@ -993,6 +1067,41 @@ static struct dmem_region *find_dmem_region(phys_addr_t phys_addr, return NULL; } +static unsigned int free_dmempages_prepare(struct dmempage *dmempage, + unsigned int dpages_nr) +{ + unsigned int i, ret = 0; + + for (i = 0; i < dpages_nr; i++, dmempage++) + if (put_dmempage_testzero(dmempage)) + ret++; + + return ret; +} + +void __dmem_free_pages(struct dmempage *dmempage, + unsigned int dpages_nr, + struct dmem_region *dregion, + struct dmem_node *pdnode) +{ + phys_addr_t dpage = dmempage_to_dpage(dmempage, dregion); + u64 pos; + unsigned long err_dpages; + + trace_dmem_free_pages(dpage_to_phys(dpage), dpages_nr); + WARN_ON(!dmem_pool.dpage_shift); + + pos = dpage - dregion->dpage_start_pfn; + dregion->next_free_pos = min(dregion->next_free_pos, pos); + + /* it is not possible to span multiple regions */ + WARN_ON(dpage + dpages_nr - 1 >= dregion->dpage_end_pfn); + + err_dpages = dmem_alloc_bitmap_clear(dregion, dpage, dpages_nr); + + dnode_count_free_dpages(pdnode, dpages_nr - err_dpages); +} + /* * free dmem page to the dmem pool * @addr: the physical addree will be freed @@ -1002,27 +1111,26 @@ void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr) { struct dmem_region *dregion; struct dmem_node *pdnode = NULL; - phys_addr_t dpage = phys_to_dpage(addr); - u64 pos; - unsigned long err_dpages; + struct dmempage *dmempage; + unsigned int nr; mutex_lock(&dmem_pool.lock); - trace_dmem_free_pages(addr, dpages_nr); - WARN_ON(!dmem_pool.dpage_shift); - dregion = find_dmem_region(addr, &pdnode); WARN_ON(!dregion || !dregion->bitmap || !pdnode); - pos = dpage - dregion->dpage_start_pfn; - dregion->next_free_pos = min(dregion->next_free_pos, pos); - - /* it is not possible to span multiple regions */ - WARN_ON(dpage + dpages_nr - 1 >= dregion->dpage_end_pfn); + dmempage = pfn_to_dmempage(PHYS_PFN(addr), dregion); - err_dpages = dmem_alloc_bitmap_clear(dregion, dpage, dpages_nr); + nr = free_dmempages_prepare(dmempage, dpages_nr); + if (nr == dpages_nr) + __dmem_free_pages(dmempage, dpages_nr, dregion, pdnode); + else if (nr) + while (dpages_nr--, dmempage++) { + if (dmempage_count(dmempage)) + continue; + __dmem_free_pages(dmempage, 1, dregion, pdnode); + } - dnode_count_free_dpages(pdnode, dpages_nr - err_dpages); mutex_unlock(&dmem_pool.lock); } EXPORT_SYMBOL(dmem_free_pages); @@ -1073,3 +1181,16 @@ bool is_dmem_pfn(unsigned long pfn) return !!find_dmem_region(__pfn_to_phys(pfn), &dnode); } EXPORT_SYMBOL(is_dmem_pfn); + +void get_dmem_pfn(unsigned long pfn) +{ + struct dmem_region *dregion = find_dmem_region(PFN_PHYS(pfn), NULL); + struct dmempage *dmempage; + + VM_BUG_ON(!dregion || !dregion->memmap); + + dmempage = pfn_to_dmempage(pfn, dregion); + VM_BUG_ON(dmempage_count(dmempage) + 127u <= 127u); + dmempage_ref_inc(dmempage); +} +EXPORT_SYMBOL(get_dmem_pfn); From patchwork Mon Dec 7 11:31:29 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955489 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3501AC4167B for ; Mon, 7 Dec 2020 11:37:01 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id F1D5C233A1 for ; Mon, 7 Dec 2020 11:37:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727681AbgLGLgX (ORCPT ); Mon, 7 Dec 2020 06:36:23 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41168 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726638AbgLGLgW (ORCPT ); Mon, 7 Dec 2020 06:36:22 -0500 Received: from mail-pg1-x52c.google.com (mail-pg1-x52c.google.com [IPv6:2607:f8b0:4864:20::52c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6F203C0613D4; Mon, 7 Dec 2020 03:35:42 -0800 (PST) Received: by mail-pg1-x52c.google.com with SMTP id f17so8656765pge.6; Mon, 07 Dec 2020 03:35:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=01Z5fklCLxTkAgFCaz6nAUEBM28bmnYK87yE7XDoghk=; b=Ow37ASksRvhh9CbdtCHIUcexMF92+y/fcxUo4cp1Hcu1OVAkm4VgeqQuRQn3wk6HYw HBIvMgI3NhxdwdBCKwt5Iq4O8LCN4Ax1wiG4DsCvRan0q4fRQKL8bzXsSjED1ZZnfPtW i2fYicNCYucP2yFhvKqZO390HegOf8okFuF473TH47NWnU5wevHxKx8xWUmYQMv1OtGt 0lIXeFTqGZWaY2z+qKMSdfJnMGbGjjHGIm7bp+KVSSw1iX8dSHyykquckNwiMQ3vazJb w8osDFx9fg+nq686wIqW3v7PfEnSUS4MCwgVJ3Vx8c8qTcTONTPefKk36TylgEAY+cPI pfmg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=01Z5fklCLxTkAgFCaz6nAUEBM28bmnYK87yE7XDoghk=; b=PR0MrBiA9//P1Ma7RKw7i/yhY1UhpcAh980wx1c3fqfEF5wF0Xjdlgzmnvm+W3aWwR f2KSBT+r0OuyDsYCKs18bjudL6dqBgs88wVbf4AisVEgtvr3zPLqIASAwQoADNqcPV+d RXktpWd8Ww6jlCNmaIIdiNDrXXydbWalOQ13JxgFEt7KpTpoM9mnt2ktkVp3gBxhplbk kQ8LGcgo7/ebm/CDLykjb825w8tKL695Cz1zzVYzRBX2z6+3UM7vnnH9cBfK8cMFhlia RUJrBnkS2AUgQOIcqGGO27voFpPivp20rz6mWJXhL4HCNW9vGS5pRA2/F5FLgvsRPq9S Hulw== X-Gm-Message-State: AOAM5322XJ3Lvmry8oFXCykZNgd1cpJ2fCGY/YU0BX1rGZt0UTeHEk+X 6H5k8XGOGbqupDhdPo4au3Q= X-Google-Smtp-Source: ABdhPJyc0QahNAcq2tXM9yV1P4tXgagCj+zgpJi+Q0nk6M3KkIQ/sHcL91OhO1SYZDhkmBDra3QOpQ== X-Received: by 2002:a63:a551:: with SMTP id r17mr3001298pgu.13.1607340942035; Mon, 07 Dec 2020 03:35:42 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.35.38 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:35:41 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [RFC V2 36/37] vfio: support dmempage refcount for vfio Date: Mon, 7 Dec 2020 19:31:29 +0800 Message-Id: <0e5dd1479a55d8af7adfe44390f8e45186295dce.1607332046.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang Add get/put_dmem_pfn(), each time when vfio module reference/release dmempages. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- drivers/vfio/vfio_iommu_type1.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index c465d1a..4856a89 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -39,6 +39,7 @@ #include #include #include +#include #define DRIVER_VERSION "0.2" #define DRIVER_AUTHOR "Alex Williamson " @@ -411,7 +412,10 @@ static int put_pfn(unsigned long pfn, int prot) unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE); return 1; - } + } else if (is_dmem_pfn(pfn)) + put_dmem_pfn(pfn); + + /* Dmem page is not counted against user. */ return 0; } @@ -477,6 +481,9 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, if (!ret && !is_invalid_reserved_pfn(*pfn)) ret = -EFAULT; + + if (!ret && is_dmem_pfn(*pfn)) + get_dmem_pfn(*pfn); } done: mmap_read_unlock(mm); From patchwork Mon Dec 7 11:31:30 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11955491 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 66E70C1B0D8 for ; Mon, 7 Dec 2020 11:37:01 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3FC34233A0 for ; Mon, 7 Dec 2020 11:37:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727713AbgLGLge (ORCPT ); Mon, 7 Dec 2020 06:36:34 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41182 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727533AbgLGLgb (ORCPT ); Mon, 7 Dec 2020 06:36:31 -0500 Received: from mail-pf1-x444.google.com (mail-pf1-x444.google.com [IPv6:2607:f8b0:4864:20::444]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8A56AC08E85E; Mon, 7 Dec 2020 03:35:45 -0800 (PST) Received: by mail-pf1-x444.google.com with SMTP id 131so9603841pfb.9; Mon, 07 Dec 2020 03:35:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :mime-version:content-transfer-encoding; bh=8dMrBUV0z9hLgybDV1qAgRrexcK6XcuJtlWOIrJYhH0=; b=QSXG0JhfnoAGvVBFx63sba2FOkjDtHFQBx/E4eY2XASWrdpLT0Ew5Lxo9DejGl9h++ 5V2/EM/2abVeCbQe6r6uVAcXu38nGBRW5kLnKtZtrVY4TlxWPSKHSY/zlUjObVbTs+wL x9eXcsTw9I5f8s7ws9VtyPX5JLLnZOlJuXOxZXXyCPUjoqPY9XoS104KOPsvqi2TP9Qr dlnqXrw19GS6AKRnSVxl8hUOIkMKM9yyU4zUfq/dJvLXMP8hln1Ay82xzQkUBHKSwNDx ejDThi/5DYZ0I6J/39QFbRF3Y/8kvDD3Eiy1+tKCQcQLGzMn2NiZ6exaemi55aBwIr5D oy1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:mime-version:content-transfer-encoding; bh=8dMrBUV0z9hLgybDV1qAgRrexcK6XcuJtlWOIrJYhH0=; b=rqDUjrk9EBFzosUBsLuEIPuUifylOKbqG4xu0KnAcUBzOy52Ay+soY4tZnJT5N+z77 eQjSsKS/umKVTqH9EKIiL0cK2uj6ZAswIDoAVYNHt7q1iGhDOy5J17WqgTkq4tAIwE5g 11DIOBna0vQ+yc6PKw/0J56Kv2Z4jKHmi4qAWEVbfj6AaaVvJoTeYT26IrHBcO6OtcBF evcUGCW2HvHV9RgQLQY2ZPsdnFJgc+3qvBRZYcENziJk6fwoXkbV2d2uhg1TTlS6lvxe kTaPkSUUWQeEECWNnXPaKsd+3bZVfKPZirbkNz5O2tYXPvCL4A2dPTot8Ho7nfidu3j3 DLtQ== X-Gm-Message-State: AOAM530EPFOaGZ/Nf+uduxfYSpksR9a+B35ZKeM7LDqKIKvxiLuxSyaq Pyp/WuKryd9hicpkKI1IvP4= X-Google-Smtp-Source: ABdhPJxr+/fznfTo2BDgJytsbTQDDtDhOhsgkc3XiXicQdpt2zupWpmvamPhl7LZ72PAGPtdf/AUSg== X-Received: by 2002:a05:6a00:7c5:b029:19e:2965:7a6 with SMTP id n5-20020a056a0007c5b029019e296507a6mr2120711pfu.60.1607340945089; Mon, 07 Dec 2020 03:35:45 -0800 (PST) Received: from localhost.localdomain ([203.205.141.39]) by smtp.gmail.com with ESMTPSA id d4sm14219822pfo.127.2020.12.07.03.35.42 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 07 Dec 2020 03:35:44 -0800 (PST) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: linux-mm@kvack.org, akpm@linux-foundation.org, linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: joao.m.martins@oracle.com, rdunlap@infradead.org, sean.j.christopherson@intel.com, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang Subject: [RFC V2 37/37] Add documentation for dmemfs Date: Mon, 7 Dec 2020 19:31:30 +0800 Message-Id: <6a3a71f75dad1fa440677fc1bcdc170f178be1d8.1607332046.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org From: Yulei Zhang Introduce dmemfs.rst to document the basic usage of dmemfs. Signed-off-by: Yulei Zhang --- Documentation/filesystems/dmemfs.rst | 58 ++++++++++++++++++++++++++++++++++++ Documentation/filesystems/index.rst | 1 + 2 files changed, 59 insertions(+) create mode 100644 Documentation/filesystems/dmemfs.rst diff --git a/Documentation/filesystems/dmemfs.rst b/Documentation/filesystems/dmemfs.rst new file mode 100644 index 00000000..f13ed0c --- /dev/null +++ b/Documentation/filesystems/dmemfs.rst @@ -0,0 +1,58 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================== +The Direct Memory Filesystem - DMEMFS +===================================== + + +.. Table of contents + + - Overview + - Compilation + - Usage + +Overview +======== + +Dmemfs (Direct Memory filesystem) is device memory or reserved +memory based filesystem. This kind of memory is special as it +is not managed by kernel and it is without 'struct page'. Therefore +it can save extra memory from the host system for various usage, +especially for guest virtual machines. + +It uses a kernel boot parameter ``dmem=`` to reserve the system +memory when the host system boots up, the details can be checked +in /Documentation/admin-guide/kernel-parameters.txt. + +Compilation +=========== + +The filesystem should be enabled by turning on the kernel configuration +options:: + + CONFIG_DMEM_FS - Direct Memory filesystem support + CONFIG_DMEM - Allow reservation of memory for dmem + + +Additionally, the following can be turned on to aid debugging:: + + CONFIG_DMEM_DEBUG_FS - Enable debug information for dmem + +Usage +======== + +Dmemfs supports mapping ``4K``, ``2M`` and ``1G`` size of pages to +the userspace, for example :: + + # mount -t dmemfs none -o pagesize=4K /mnt/ + +The it can create the backing storage with 4G size :: + + # truncate /mnt/dmemfs-uuid --size 4G + +To use as backing storage for virtual machine starts with qemu, just need +to specify the memory-backed-file in the qemu command line like this :: + + # -object memory-backend-file,id=ram-node0,mem-path=/mnt/dmemfs-uuid \ + share=yes,size=4G,host-nodes=0,policy=preferred -numa node,nodeid=0,memdev=ram-node0 + diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index 98f59a8..23e944b 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -120,3 +120,4 @@ Documentation for filesystem implementations. xfs-delayed-logging-design xfs-self-describing-metadata zonefs + dmemfs