From patchwork Thu Oct 8 07:53:51 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822301 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EAC00109B for ; Thu, 8 Oct 2020 07:53:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AC43521924 for ; Thu, 8 Oct 2020 07:53:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="oj426ml9" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728418AbgJHHx3 (ORCPT ); Thu, 8 Oct 2020 03:53:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51786 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725899AbgJHHx3 (ORCPT ); Thu, 8 Oct 2020 03:53:29 -0400 Received: from mail-pf1-x442.google.com (mail-pf1-x442.google.com [IPv6:2607:f8b0:4864:20::442]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 25448C0613D2; Thu, 8 Oct 2020 00:53:29 -0700 (PDT) Received: by mail-pf1-x442.google.com with SMTP id x22so3294014pfo.12; Thu, 08 Oct 2020 00:53:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=PpGMDQVveSnL4hAMh4jtybKG1eOjYS1M5jvQhB105cE=; b=oj426ml97WT/RNTqwQTs0CwCzfg2cbYR4xc2iVimdK+SNB5/05si5k8oCoHNrZCNXg jIW8j9PshRowyt6PiAfXAhfM3XJniAM/xcTHsbW35YSJnV5alOjZB/my8FbSswkgtbAi ZR2vPJQQ2mS1CT5xR18pJgwdXigmlqAhowEFs1HyR/b548A4DfV6eJofS4HYTCZ1KExQ F1m4sK+1NXRtr73xe1dkO4hxUtvO8lnbcLvEASSDIiG0z2CLtWZX6+cyjHweqc863QIq CyL+a6zrvLApUrtvUTnfVQAQhh6n7oecOV7vpOOMknvlua/nF9ac7zpeNUCm8YXTyibQ eLHA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=PpGMDQVveSnL4hAMh4jtybKG1eOjYS1M5jvQhB105cE=; b=slpIil7lZ1Tuz36xU48cBrrCVOzlx4Jo2yWczPjLUW/1dOmtHdvHatYXHRPj2myuKf XycYpdHD7EG7O7wBNPzIUsNRoDuHiNHX391qxXsjuGwjG/NCd5E1Za7DsN1R3Z9QAXe8 NDURb6cofFowdtiQkD26lXhdSjM36z2gcOG30L8OCI1BMOXiEouV4yf5ukT2JDqrAfoC GH4vtfw7j9A3zF2cqWkoHaaK7QLZ25hxXisSPqRRDzsnIv1JWu6/rzvC5RiwDDHRikFe vA5QGqpjpJc7WumPNov7IulfpLdP04ULiMCXB+YX5SqsRHX3B+6oq1OIoiEW6RA6c7AL 9v5g== X-Gm-Message-State: AOAM533VxQIkiH2fsGsm6R7K/SgydVm31jE2Df0pyTb0njpn9dMYxY/u HaH91Qqens+bznP8DTumRXo= X-Google-Smtp-Source: ABdhPJxJedQZTY5EG7HomEYXY1tfz0id+kobQNH9f2jmxI3zfeVYTcxCbti+nlZAWEbUMHaUDjVH5g== X-Received: by 2002:a63:7841:: with SMTP id t62mr6279888pgc.183.1602143608642; Thu, 08 Oct 2020 00:53:28 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.53.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:53:28 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [PATCH 01/35] fs: introduce dmemfs module Date: Thu, 8 Oct 2020 15:53:51 +0800 Message-Id: X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang dmemfs (Direct Memory filesystem) is device memory or reserved memory based filesystem. This kind of memory is special as it is not managed by kernel and it is without 'struct page'. The original purpose of dmemfs is to drop the usage of 'struct page' to save extra system memory. This patch introduces the basic framework of dmemfs and only mkdir and create regular file are supported. Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- fs/Kconfig | 1 + fs/Makefile | 1 + fs/dmemfs/Kconfig | 13 ++ fs/dmemfs/Makefile | 7 + fs/dmemfs/inode.c | 275 +++++++++++++++++++++++++++++++++++++ include/uapi/linux/magic.h | 1 + 6 files changed, 298 insertions(+) create mode 100644 fs/dmemfs/Kconfig create mode 100644 fs/dmemfs/Makefile create mode 100644 fs/dmemfs/inode.c diff --git a/fs/Kconfig b/fs/Kconfig index aa4c12282301..18e72089426f 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -41,6 +41,7 @@ source "fs/btrfs/Kconfig" source "fs/nilfs2/Kconfig" source "fs/f2fs/Kconfig" source "fs/zonefs/Kconfig" +source "fs/dmemfs/Kconfig" config FS_DAX bool "Direct Access (DAX) support" diff --git a/fs/Makefile b/fs/Makefile index 1c7b0e3f6daa..10e0302c5902 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -136,3 +136,4 @@ obj-$(CONFIG_EFIVAR_FS) += efivarfs/ obj-$(CONFIG_EROFS_FS) += erofs/ obj-$(CONFIG_VBOXSF_FS) += vboxsf/ obj-$(CONFIG_ZONEFS_FS) += zonefs/ +obj-$(CONFIG_DMEM_FS) += dmemfs/ diff --git a/fs/dmemfs/Kconfig b/fs/dmemfs/Kconfig new file mode 100644 index 000000000000..d2894a513de0 --- /dev/null +++ b/fs/dmemfs/Kconfig @@ -0,0 +1,13 @@ +config DMEM_FS + tristate "Direct Memory filesystem support" + help + dmemfs (Direct Memory filesystem) is device memory or reserved + memory based filesystem. This kind of memory is special as it + is not managed by kernel and it is without 'struct page'. + + The original purpose of dmemfs is saving extra memory of + 'struct page' that reduces the total cost of ownership (TCO) + for cloud providers. + + To compile this file system support as a module, choose M here: the + module will be called dmemfs. diff --git a/fs/dmemfs/Makefile b/fs/dmemfs/Makefile new file mode 100644 index 000000000000..73bdc9cbc87e --- /dev/null +++ b/fs/dmemfs/Makefile @@ -0,0 +1,7 @@ +# SPDX-License-Identifier: GPL-2.0 +# +# Makefile for the linux dmem-filesystem routines. +# +obj-$(CONFIG_DMEM_FS) += dmemfs.o + +dmemfs-y += inode.o diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c new file mode 100644 index 000000000000..6a8a2d9f94e9 --- /dev/null +++ b/fs/dmemfs/inode.c @@ -0,0 +1,275 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * linux/fs/dmemfs/inode.c + * + * Authors: + * Xiao Guangrong + * Chen Zhuo + * Haiwei Li + * Yulei Zhang + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +MODULE_AUTHOR("Tencent Corporation"); +MODULE_LICENSE("GPL v2"); + +struct dmemfs_mount_opts { + unsigned long dpage_size; +}; + +struct dmemfs_fs_info { + struct dmemfs_mount_opts mount_opts; +}; + +enum dmemfs_param { + Opt_dpagesize, +}; + +const struct fs_parameter_spec dmemfs_fs_parameters[] = { + fsparam_string("pagesize", Opt_dpagesize), + {} +}; + +static int check_dpage_size(unsigned long dpage_size) +{ + if (dpage_size != PAGE_SIZE && dpage_size != PMD_SIZE && + dpage_size != PUD_SIZE) + return -EINVAL; + + return 0; +} + +static struct inode * +dmemfs_get_inode(struct super_block *sb, const struct inode *dir, umode_t mode, + dev_t dev); + +static int +dmemfs_mknod(struct inode *dir, struct dentry *dentry, umode_t mode, dev_t dev) +{ + struct inode *inode = dmemfs_get_inode(dir->i_sb, dir, mode, dev); + int error = -ENOSPC; + + if (inode) { + d_instantiate(dentry, inode); + dget(dentry); /* Extra count - pin the dentry in core */ + error = 0; + dir->i_mtime = dir->i_ctime = current_time(inode); + } + return error; +} + +static int dmemfs_create(struct inode *dir, struct dentry *dentry, + umode_t mode, bool excl) +{ + return dmemfs_mknod(dir, dentry, mode | S_IFREG, 0); +} + +static int dmemfs_mkdir(struct inode *dir, struct dentry *dentry, + umode_t mode) +{ + int retval = dmemfs_mknod(dir, dentry, mode | S_IFDIR, 0); + + if (!retval) + inc_nlink(dir); + return retval; +} + +static const struct inode_operations dmemfs_dir_inode_operations = { + .create = dmemfs_create, + .lookup = simple_lookup, + .unlink = simple_unlink, + .mkdir = dmemfs_mkdir, + .rmdir = simple_rmdir, + .rename = simple_rename, +}; + +static const struct inode_operations dmemfs_file_inode_operations = { + .setattr = simple_setattr, + .getattr = simple_getattr, +}; + +int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma) +{ + return 0; +} + +static const struct file_operations dmemfs_file_operations = { + .mmap = dmemfs_file_mmap, +}; + +static int dmemfs_parse_param(struct fs_context *fc, struct fs_parameter *param) +{ + struct dmemfs_fs_info *fsi = fc->s_fs_info; + struct fs_parse_result result; + int opt, ret; + + opt = fs_parse(fc, dmemfs_fs_parameters, param, &result); + if (opt < 0) + return opt; + + switch (opt) { + case Opt_dpagesize: + fsi->mount_opts.dpage_size = memparse(param->string, NULL); + ret = check_dpage_size(fsi->mount_opts.dpage_size); + if (ret) { + pr_warn("dmemfs: unknown pagesize %x.\n", + result.uint_32); + return ret; + } + break; + default: + pr_warn("dmemfs: unknown mount option [%x].\n", + opt); + return -EINVAL; + } + + return 0; +} + +struct inode *dmemfs_get_inode(struct super_block *sb, + const struct inode *dir, umode_t mode, dev_t dev) +{ + struct inode *inode = new_inode(sb); + + if (inode) { + inode->i_ino = get_next_ino(); + inode_init_owner(inode, dir, mode); + inode->i_mapping->a_ops = &empty_aops; + mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER); + mapping_set_unevictable(inode->i_mapping); + inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode); + switch (mode & S_IFMT) { + default: + init_special_inode(inode, mode, dev); + break; + case S_IFREG: + inode->i_op = &dmemfs_file_inode_operations; + inode->i_fop = &dmemfs_file_operations; + break; + case S_IFDIR: + inode->i_op = &dmemfs_dir_inode_operations; + inode->i_fop = &simple_dir_operations; + + /* + * directory inodes start off with i_nlink == 2 + * (for "." entry) + */ + inc_nlink(inode); + break; + case S_IFLNK: + inode->i_op = &page_symlink_inode_operations; + break; + } + } + return inode; +} + +static int dmemfs_statfs(struct dentry *dentry, struct kstatfs *buf) +{ + simple_statfs(dentry, buf); + buf->f_bsize = dentry->d_sb->s_blocksize; + + return 0; +} + +static const struct super_operations dmemfs_ops = { + .statfs = dmemfs_statfs, + .drop_inode = generic_delete_inode, +}; + +static int +dmemfs_fill_super(struct super_block *sb, struct fs_context *fc) +{ + struct inode *inode; + struct dmemfs_fs_info *fsi = sb->s_fs_info; + + sb->s_maxbytes = MAX_LFS_FILESIZE; + sb->s_blocksize = fsi->mount_opts.dpage_size; + sb->s_blocksize_bits = ilog2(fsi->mount_opts.dpage_size); + sb->s_magic = DMEMFS_MAGIC; + sb->s_op = &dmemfs_ops; + sb->s_time_gran = 1; + + inode = dmemfs_get_inode(sb, NULL, S_IFDIR, 0); + sb->s_root = d_make_root(inode); + if (!sb->s_root) + return -ENOMEM; + + return 0; +} + +static int dmemfs_get_tree(struct fs_context *fc) +{ + return get_tree_nodev(fc, dmemfs_fill_super); +} + +static void dmemfs_free_fc(struct fs_context *fc) +{ + kfree(fc->s_fs_info); +} + +static const struct fs_context_operations dmemfs_context_ops = { + .free = dmemfs_free_fc, + .parse_param = dmemfs_parse_param, + .get_tree = dmemfs_get_tree, +}; + +int dmemfs_init_fs_context(struct fs_context *fc) +{ + struct dmemfs_fs_info *fsi; + + fsi = kzalloc(sizeof(*fsi), GFP_KERNEL); + if (!fsi) + return -ENOMEM; + + fsi->mount_opts.dpage_size = PAGE_SIZE; + fc->s_fs_info = fsi; + fc->ops = &dmemfs_context_ops; + return 0; +} + +static void dmemfs_kill_sb(struct super_block *sb) +{ + kill_litter_super(sb); +} + +static struct file_system_type dmemfs_fs_type = { + .owner = THIS_MODULE, + .name = "dmemfs", + .init_fs_context = dmemfs_init_fs_context, + .kill_sb = dmemfs_kill_sb, +}; + +static int __init dmemfs_init(void) +{ + int ret; + + ret = register_filesystem(&dmemfs_fs_type); + + return ret; +} + +static void __exit dmemfs_uninit(void) +{ + unregister_filesystem(&dmemfs_fs_type); +} + +module_init(dmemfs_init) +module_exit(dmemfs_uninit) diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index f3956fc11de6..3fbd06661c8c 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -97,5 +97,6 @@ #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ #define Z3FOLD_MAGIC 0x33 #define PPC_CMM_MAGIC 0xc7571590 +#define DMEMFS_MAGIC 0x2ace90c6 #endif /* __LINUX_MAGIC_H__ */ From patchwork Thu Oct 8 07:53:52 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822307 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 62C951580 for ; Thu, 8 Oct 2020 07:53:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2791A21924 for ; Thu, 8 Oct 2020 07:53:44 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="A4CotWTU" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728447AbgJHHxk (ORCPT ); Thu, 8 Oct 2020 03:53:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51818 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728243AbgJHHxk (ORCPT ); Thu, 8 Oct 2020 03:53:40 -0400 Received: from mail-pf1-x442.google.com (mail-pf1-x442.google.com [IPv6:2607:f8b0:4864:20::442]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F01BDC061755; Thu, 8 Oct 2020 00:53:39 -0700 (PDT) Received: by mail-pf1-x442.google.com with SMTP id g10so3313926pfc.8; Thu, 08 Oct 2020 00:53:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=qJjIPlozmkNYa4hgIbvz/Ie5BxbOL9v4A/WGUmTXOls=; b=A4CotWTUv4SSKawvuG4fyX6lmUYxauvlfzbK1VGFGliw/+Hb4vF0kP8GJdQz4m7JCS zcOt6ayVl6ZCzFN6BS3VAPJxOvIimyACt8iGuIvSGcZ5d0DZ9oKSTUi5XgiAUAdKDcwO 8UC8kML3jRdtqYTI4PzQLf8OYsO+WzLhTwWIE5RRZRPRsoGTOf5Z1agcglxa4bBcirrs yb3XOIkfpGPu/QOnfcxSHqCb9LvFFB78MuMLnbBnGc1oMpDkf2X+IkXeCDiKLSdU1pTH 7XHLFH9TDBbYG7k3BW09q8DElgZDdwWo4zqxuMGtDNvqdniygQmWNyAOc4mImThOKqPU vKnw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=qJjIPlozmkNYa4hgIbvz/Ie5BxbOL9v4A/WGUmTXOls=; b=c30I08WcujtvKHxR2HDDITENeJwx1SmMtcbg/3oOhzzK0/nVa0EdDhOlvemPgC4TyT gRKcC1dv6OkeuU+5PIKbmQRqP0swWesy5KK7HoDPxi9cedP8Tz42rfYKQsPuvLcnDJDq ZO2O98LzUZggIOKVHc5nvWV9aN+ctMbqhN7MuYsVFRabp0nhLAgD91zkDNKKs26B5FHn 8qE0pMpkS9Yjh4BvpgPqYpzAERqjUUuE+VJUH5bz3+KxgxTnC2coCzzRODqrqOclY5rt MtHEs2FfHBRy+QLoVhxGapJpQXpvGA+JGX7dzIDQunEWPrcdi0ISueXuHdLHluTt9pG0 RFZg== X-Gm-Message-State: AOAM531hWnLLVah2Crzo16lJ81SLvpBFU+D6U242gfKr68nHvNgG4S/K Z/QrUJ2ZkQ5qeForQ/zEJmE= X-Google-Smtp-Source: ABdhPJy8yQp1z0/EalCAAT6S8LqSFEXVSQ8GG6uPZn0IqwOk8P7RdzdwXjS6vW3DkFtBn5pa0rRPDw== X-Received: by 2002:a62:5382:0:b029:155:6333:ce4f with SMTP id h124-20020a6253820000b02901556333ce4fmr777107pfb.28.1602143619365; Thu, 08 Oct 2020 00:53:39 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.53.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:53:38 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [PATCH 02/35] mm: support direct memory reservation Date: Thu, 8 Oct 2020 15:53:52 +0800 Message-Id: <2fbc347a5f52591fc9da8d708fef0be238eb06a5.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang Introduce 'dmem=' to reserve system memory for DMEM (direct memory), comparing with 'mem=' and 'memmap', it reserves memory based on the topology of NUMA, for the detailed info, please refer to kernel-parameters.txt Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- .../admin-guide/kernel-parameters.txt | 38 +++ arch/x86/kernel/setup.c | 3 + include/linux/dmem.h | 16 + mm/Kconfig | 9 + mm/Makefile | 1 + mm/dmem.c | 137 ++++++++ mm/dmem_reserve.c | 303 ++++++++++++++++++ 7 files changed, 507 insertions(+) create mode 100644 include/linux/dmem.h create mode 100644 mm/dmem.c create mode 100644 mm/dmem_reserve.c diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index a1068742a6df..da15d4fc49db 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -980,6 +980,44 @@ The filter can be disabled or changed to another driver later using sysfs. + dmem=[!]size[KMG] + [KNL, NUMA] When CONFIG_DMEM is set, this means + the size of memory reserved for dmemfs on each numa + memory node and 'size' must be aligned to the default + alignment that is the size of memory section which is + 128M on default on x86_64. If set '!', such amount of + memory on each node will be owned by kernel and dmemfs + own the rest of memory on each node. + Example: Reserve 4G memory on each node for dmemfs + dmem = 4G + + dmem=[!]size[KMG]:align[KMG] + [KNL, NUMA] Ditto. 'align' should be power of two and + it's not smaller than the default alignment. Also + 'size' must be aligned to 'align'. + Example: Bad dmem parameter because 'size' misaligned + dmem=0x40200000:1G + + dmem=size[KMG]@addr[KMG] + [KNL] When CONFIG_DMEM is set, this marks specific + memory as reserved for dmemfs. Region of memory will be + used by dmemfs, from addr to addr + size. Reserving a + certain memory region for kernel is illegal so '!' is + forbidden. Should not assign 'addr' to 0 because kernel + will occupy fixed memory region begin at 0 address. + Ditto, 'size' and 'addr' must be aligned to default + alignment. + Example: Exclude memory from 5G-6G for dmemfs. + dmem=1G@5G + + dmem=size[KMG]@addr[KMG]:align[KMG] + [KNL] Ditto. 'align' should be power of two and it's + not smaller than the default alignment. Also 'size' + and 'addr' must be aligned to 'align'. Specially, + '@addr' and ':align' could occur in any order. + Example: Exclude memory from 5G-6G for dmemfs. + dmem=1G:1G@5G + driver_async_probe= [KNL] List of driver names to be probed asynchronously. Format: ,... diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c index 3511736fbc74..c2e59093a95e 100644 --- a/arch/x86/kernel/setup.c +++ b/arch/x86/kernel/setup.c @@ -45,6 +45,7 @@ #include #include #include +#include /* * max_low_pfn_mapped: highest directly mapped pfn < 4 GB @@ -1177,6 +1178,8 @@ void __init setup_arch(char **cmdline_p) if (!early_xdbc_setup_hardware()) early_xdbc_register_console(); + dmem_reserve_init(); + x86_init.paging.pagetable_init(); kasan_init(); diff --git a/include/linux/dmem.h b/include/linux/dmem.h new file mode 100644 index 000000000000..5049322d941c --- /dev/null +++ b/include/linux/dmem.h @@ -0,0 +1,16 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +#ifndef _LINUX_DMEM_H +#define _LINUX_DMEM_H + +#ifdef CONFIG_DMEM +int dmem_reserve_init(void); +void dmem_init(void); +int dmem_region_register(int node, phys_addr_t start, phys_addr_t end); + +#else +static inline int dmem_reserve_init(void) +{ + return 0; +} +#endif +#endif /* _LINUX_DMEM_H */ diff --git a/mm/Kconfig b/mm/Kconfig index 6c974888f86f..e1995da11cea 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -226,6 +226,15 @@ config BALLOON_COMPACTION scenario aforementioned and helps improving memory defragmentation. # +# support for direct memory basics +config DMEM + bool "Direct Memory Reservation" + def_bool n + depends on SPARSEMEM + help + Allow reservation of memory which could be dedicated usage of dmem. + It's the basics of dmemfs. + # support for memory compaction config COMPACTION bool "Allow for memory compaction" diff --git a/mm/Makefile b/mm/Makefile index d5649f1c12c0..97fa2fdf492e 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -121,3 +121,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o obj-$(CONFIG_PTDUMP_CORE) += ptdump.o obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o +obj-$(CONFIG_DMEM) += dmem.o dmem_reserve.o diff --git a/mm/dmem.c b/mm/dmem.c new file mode 100644 index 000000000000..b5fb4f1b92db --- /dev/null +++ b/mm/dmem.c @@ -0,0 +1,136 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * memory management for dmemfs + * + * Authors: + * Xiao Guangrong + * Chen Zhuo + * Haiwei Li + * Yulei Zhang + */ +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* + * There are two kinds of page in dmem management: + * - nature page, it's the CPU's page size, i.e, 4K on x86 + * + * - dmem page, it's the unit size used by dmem itself to manage all + * registered memory. It's set by dmem_alloc_init() + */ +struct dmem_region { + /* original registered memory region */ + phys_addr_t reserved_start_addr; + phys_addr_t reserved_end_addr; + + /* memory region aligned to dmem page */ + phys_addr_t dpage_start_pfn; + phys_addr_t dpage_end_pfn; + + /* + * avoid memory allocation if the dmem region is small enough + */ + unsigned long static_bitmap; + unsigned long *bitmap; + u64 next_free_pos; + struct list_head node; + + unsigned long static_error_bitmap; + unsigned long *error_bitmap; +}; + +/* + * statically define number of regions to avoid allocating memory + * dynamically from memblock as slab is not available at that time + */ +#define DMEM_REGION_PAGES 2 +#define INIT_REGION_NUM \ + ((DMEM_REGION_PAGES << PAGE_SHIFT) / sizeof(struct dmem_region)) + +static struct dmem_region static_regions[INIT_REGION_NUM]; + +struct dmem_node { + unsigned long total_dpages; + unsigned long free_dpages; + + /* fallback list for allocation */ + int nodelist[MAX_NUMNODES]; + struct list_head regions; +}; + +struct dmem_pool { + struct mutex lock; + + unsigned long region_num; + unsigned long registered_pages; + unsigned long unaligned_pages; + + /* shift bits of dmem page */ + unsigned long dpage_shift; + + unsigned long total_dpages; + unsigned long free_dpages; + + /* + * increased when allocator is initialized, + * stop it being destroyed when someone is + * still using it + */ + u64 user_count; + struct dmem_node nodes[MAX_NUMNODES]; +}; + +static struct dmem_pool dmem_pool = { + .lock = __MUTEX_INITIALIZER(dmem_pool.lock), +}; + +#define for_each_dmem_node(_dnode) \ + for (_dnode = dmem_pool.nodes; \ + _dnode < dmem_pool.nodes + ARRAY_SIZE(dmem_pool.nodes); \ + _dnode++) + +void __init dmem_init(void) +{ + struct dmem_node *dnode; + + pr_info("dmem: pre-defined region: %ld\n", INIT_REGION_NUM); + + for_each_dmem_node(dnode) + INIT_LIST_HEAD(&dnode->regions); +} + +/* + * register the memory region to dmem pool as freed memory, the region + * should be properly aligned to PAGE_SIZE at least + * + * it's safe to be out of dmem_pool's lock as it's used at the very + * beginning of system boot + */ +int dmem_region_register(int node, phys_addr_t start, phys_addr_t end) +{ + struct dmem_region *dregion; + + pr_info("dmem: register region [%#llx - %#llx] on node %d.\n", + (unsigned long long)start, (unsigned long long)end, node); + + if (unlikely(dmem_pool.region_num >= INIT_REGION_NUM)) { + pr_err("dmem: region is not sufficient.\n"); + return -ENOMEM; + } + + dregion = &static_regions[dmem_pool.region_num++]; + dregion->reserved_start_addr = start; + dregion->reserved_end_addr = end; + + list_add_tail(&dregion->node, &dmem_pool.nodes[node].regions); + dmem_pool.registered_pages += __phys_to_pfn(end) - + __phys_to_pfn(start); + return 0; +} diff --git a/mm/dmem_reserve.c b/mm/dmem_reserve.c new file mode 100644 index 000000000000..567ee9f18a7d --- /dev/null +++ b/mm/dmem_reserve.c @@ -0,0 +1,303 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Support reserved memory for dmem. + * As dmem_reserve_init will adjust memblock to reserve memory + * for dmem, we could save a vast amount of memory for 'struct page'. + * + * Authors: + * Xiao Guangrong + */ +#include +#include +#include +#include +#include + +struct dmem_param { + phys_addr_t base; + phys_addr_t size; + phys_addr_t align; + /* + * If set to 1, dmem_param specified requested memory for kernel, + * otherwise for dmem. + */ + bool resv_kernel; +}; + +static struct dmem_param dmem_param __initdata; + +/* Check dmem param defined by user to match dmem align */ +static int __init check_dmem_param(bool resv_kernel, phys_addr_t base, + phys_addr_t size, phys_addr_t align) +{ + phys_addr_t min_align = 1UL << SECTION_SIZE_BITS; + + if (!align) + align = min_align; + + /* + * the reserved region should be aligned to memory section + * at least + */ + if (align < min_align) { + pr_warn("dmem: 'align' should be %#llx at least to be aligned to memory section.\n", + min_align); + return -EINVAL; + } + + if (!is_power_of_2(align)) { + pr_warn("dmem: 'align' should be power of 2.\n"); + return -EINVAL; + } + + if (base & (align - 1)) { + pr_warn("dmem: 'addr' is unaligned to 'align' in dmem=\n"); + return -EINVAL; + } + + if (size & (align - 1)) { + pr_warn("dmem: 'size' is unaligned to 'align' in dmem=\n"); + return -EINVAL; + } + + if (base >= base + size) { + pr_warn("dmem: 'addr + size' overflow in dmem=\n"); + return -EINVAL; + } + + if (resv_kernel && base) { + pr_warn("dmem: take a certain base address for kernel is illegal\n"); + return -EINVAL; + } + + dmem_param.base = base; + dmem_param.size = size; + dmem_param.align = align; + dmem_param.resv_kernel = resv_kernel; + + pr_info("dmem: parameter: base address %#llx size %#llx align %#llx resv_kernel %d\n", + (unsigned long long)base, (unsigned long long)size, + (unsigned long long)align, resv_kernel); + return 0; +} + +static int __init parse_dmem(char *p) +{ + phys_addr_t base, size, align; + char *oldp; + bool resv_kernel = false; + + if (!p) + return -EINVAL; + + base = align = 0; + + if (*p == '!') { + resv_kernel = true; + p++; + } + + oldp = p; + size = memparse(p, &p); + if (oldp == p) + return -EINVAL; + + if (!size) { + pr_warn("dmem: 'size' of 0 defined in dmem=, or {invalid} param\n"); + return -EINVAL; + } + + while (*p) { + phys_addr_t *pvalue; + + switch (*p) { + case '@': + pvalue = &base; + break; + case ':': + pvalue = &align; + break; + default: + pr_warn("dmem: unknown indicator: %c in dmem=\n", *p); + return -EINVAL; + } + + /* + * Some attribute had been specified multiple times. + * This is not allowed. + */ + if (*pvalue) + return -EINVAL; + + oldp = ++p; + *pvalue = memparse(p, &p); + if (oldp == p) + return -EINVAL; + + if (*pvalue == 0) { + pr_warn("dmem: 'addr' or 'align' should not be set to 0\n"); + return -EINVAL; + } + } + + return check_dmem_param(resv_kernel, base, size, align); +} + +early_param("dmem", parse_dmem); + +/* + * We wanna remove a memory range from memblock.memory thoroughly. + * As isolating memblock.memory in memblock_remove needs to double + * the array of memblock_region, allocated memory for new array maybe + * locate in the memory range which we wanna to remove. + * So, conflict. + * To resolve this conflict, here reserve this memory range firstly. + * While reserving this memory range, isolating memory.reserved will allocate + * memory excluded from memory range which to be removed. So following + * double array in memblock_remove can't observe this reserved range. + */ +static void __init dmem_remove_memblock(phys_addr_t base, phys_addr_t size) +{ + memblock_reserve(base, size); + memblock_remove(base, size); + memblock_free(base, size); +} + +static u64 node_req_mem[MAX_NUMNODES] __initdata; + +/* Reserve certain size of memory for dmem in each numa node */ +static void __init dmem_reserve_size(phys_addr_t size, phys_addr_t align, + bool resv_kernel) +{ + phys_addr_t start, end; + u64 i; + int nid; + + /* Calculate available free memory on each node */ + for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE, &start, + &end, &nid) + node_req_mem[nid] += end - start; + + /* Calculate memory size needed to reserve on each node for dmem */ + for (i = 0; i < MAX_NUMNODES; i++) { + node_req_mem[i] = ALIGN(node_req_mem[i], align); + + if (!resv_kernel) { + node_req_mem[i] = min(size, node_req_mem[i]); + continue; + } + + /* leave dmem_param.size memory for kernel */ + if (node_req_mem[i] > size) + node_req_mem[i] = node_req_mem[i] - size; + else + node_req_mem[i] = 0; + } + +retry: + for_each_free_mem_range_reverse(i, NUMA_NO_NODE, MEMBLOCK_NONE, + &start, &end, &nid) { + /* Well, we have got enough memory for this node. */ + if (!node_req_mem[nid]) + continue; + + start = round_up(start, align); + end = round_down(end, align); + /* Skip memblock_region which is too small */ + if (start >= end) + continue; + + /* Towards memory block at higher address */ + start = end - min((end - start), node_req_mem[nid]); + + /* + * do not have enough resource to save the region, skip it + * from now on + */ + if (dmem_region_register(nid, start, end) < 0) + break; + + dmem_remove_memblock(start, end - start); + + node_req_mem[nid] -= end - start; + + /* We have dropped a memblock, so re-walk it. */ + goto retry; + } + + for (i = 0; i < MAX_NUMNODES; i++) { + if (!node_req_mem[i]) + continue; + + pr_info("dmem: %#llx size of memory is not reserved on node %lld due to misaligned regions.\n", + (unsigned long long)size, i); + } + +} + +/* Reserve [base, base + size) for dmem. */ +static void __init +dmem_reserve_region(phys_addr_t base, phys_addr_t size, phys_addr_t align) +{ + phys_addr_t start, end; + phys_addr_t p_start, p_end; + u64 i; + int nid; + + p_start = base; + p_end = base + size; + +retry: + for_each_free_mem_range_reverse(i, NUMA_NO_NODE, MEMBLOCK_NONE, + &start, &end, &nid) { + /* Find region located in user defined range. */ + if (start >= p_end || end <= p_start) + continue; + + start = round_up(max(start, p_start), align); + end = round_down(min(end, p_end), align); + if (start >= end) + continue; + + if (dmem_region_register(nid, start, end) < 0) + break; + + dmem_remove_memblock(start, end - start); + + size -= end - start; + if (!size) + return; + + /* We have dropped a memblock, so re-walk it. */ + goto retry; + } + + pr_info("dmem: %#llx size of memory is not reserved for dmem due to holes and misaligned regions in [%#llx, %#llx].\n", + (unsigned long long)size, (unsigned long long)base, + (unsigned long long)(base + size)); +} + +/* Reserve memory for dmem */ +int __init dmem_reserve_init(void) +{ + phys_addr_t base, size, align; + bool resv_kernel; + + dmem_init(); + + base = dmem_param.base; + size = dmem_param.size; + align = dmem_param.align; + resv_kernel = dmem_param.resv_kernel; + + /* Dmem param had not been enabled. */ + if (size == 0) + return 0; + + if (base) + dmem_reserve_region(base, size, align); + else + dmem_reserve_size(size, align, resv_kernel); + + return 0; +} From patchwork Thu Oct 8 07:53:53 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822309 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id ED84013B2 for ; Thu, 8 Oct 2020 07:53:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AF12421897 for ; Thu, 8 Oct 2020 07:53:51 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="FPRMHcbm" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728541AbgJHHxs (ORCPT ); Thu, 8 Oct 2020 03:53:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51834 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728513AbgJHHxp (ORCPT ); Thu, 8 Oct 2020 03:53:45 -0400 Received: from mail-pg1-x543.google.com (mail-pg1-x543.google.com [IPv6:2607:f8b0:4864:20::543]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5B9C7C061755; Thu, 8 Oct 2020 00:53:45 -0700 (PDT) Received: by mail-pg1-x543.google.com with SMTP id h6so3592534pgk.4; Thu, 08 Oct 2020 00:53:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=17qcDn5vRIyg2bPaSSy8ac1hftRo1I/OFiiD2zJ1u8U=; b=FPRMHcbmp5jEXalSfqvGsx1Z6+a/yODcUhA9hX0bhWIBtiwfxHBC5EzCr9mP1+cYBh JkZiW4ArogX+npADPRgQOBCQG0miEVdzzrBXPS5En0b0utrfxo87w1BIIgriI7ZArt8U XFPJKDq02ZpFNXLvctK0DY+rmyUeJFpuwi/on2UsUYwVUo4mcVZhKc7hkT59tedicZrI 0UWq7KiPCAKqfsUcMzp7IylwXI5ProiQ9A7aJkdye3pskwR4lFcQGMXS6jLIoR+7eKPA jhuASMpDPqTtjur+3vy9FnruzN6KCjJGbhcHchpWUOAw3rvaSYZHxlpNi2fXd17Z3oQ1 mbrA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=17qcDn5vRIyg2bPaSSy8ac1hftRo1I/OFiiD2zJ1u8U=; b=mvDgGFqxJTk2yrJ2LM8274RuxiaVdHi3C/nRk5yQeiFw3ISsLjLHlKhlXPg297imhp vAw46Np/G8HJndn5CnMLrQEDcDtXHeu5uve+TlGIjYRspujztQDQkN1DFK+qi0gJ7t+C 1u+y8nJ297Qy6Beq9RXuvxys3PZT14RZ3YiBWNDxJJFCWg72e4EgxK5aerEqb5guVN1w S6cFrY5gc0yj7xle7A4PLNfImUFiTeRqbRy7Pv4zMTyXCbTZJH/ZLD/oQuDhzGp4nHj3 spMObeRiMVjYvcgrAHQls5Yw0dOjyFJhyxYqBaY2lwGBFv9CqvumGWixBswf63440wUz mEug== X-Gm-Message-State: AOAM530OLJd9++aPwISidEZb1WUsDVGy6GT600Am+8WwIF3AKVtEnymB 912+prcTo/V9B/bOnPEZIwY= X-Google-Smtp-Source: ABdhPJwHCgovOwnCUDmJkfaeGZUcTSVSXgb+8Szmc/72WvKfcNwk7T0HnlnQHh2T2aegcj4+b+H/tA== X-Received: by 2002:a63:f854:: with SMTP id v20mr6613196pgj.335.1602143624814; Thu, 08 Oct 2020 00:53:44 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.53.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:53:44 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [PATCH 03/35] dmem: implement dmem memory management Date: Thu, 8 Oct 2020 15:53:53 +0800 Message-Id: <57408f6bd8122d915e46deed96a20a8ac6d90d9f.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang It introduces the interfaces to manage dmem pages that include: - dmem_region_register(), it registers the reserved memory to the dmem management system, later it can be allocated out for dmemfs - dmem_alloc_init(), initiate dmem allocator, note the page size the allocator used isn't the same thing with the alignment used to reserve dmem memory - dmem_alloc_pages_from_vma() and dmem_free_pages() are the interfaces allocating and freeing dmem memory, multiple pages can be allocated at one time, but it should be power of two Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- include/linux/dmem.h | 3 + mm/dmem.c | 674 +++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 677 insertions(+) diff --git a/include/linux/dmem.h b/include/linux/dmem.h index 5049322d941c..476a82e8f252 100644 --- a/include/linux/dmem.h +++ b/include/linux/dmem.h @@ -7,6 +7,9 @@ int dmem_reserve_init(void); void dmem_init(void); int dmem_region_register(int node, phys_addr_t start, phys_addr_t end); +int dmem_alloc_init(unsigned long dpage_shift); +void dmem_alloc_uinit(void); + #else static inline int dmem_reserve_init(void) { diff --git a/mm/dmem.c b/mm/dmem.c index b5fb4f1b92db..a77a064c8d59 100644 --- a/mm/dmem.c +++ b/mm/dmem.c @@ -91,11 +91,38 @@ static struct dmem_pool dmem_pool = { .lock = __MUTEX_INITIALIZER(dmem_pool.lock), }; +#define DMEM_PAGE_SIZE (1UL << dmem_pool.dpage_shift) +#define DMEM_PAGE_UP(x) phys_to_dpage(((x) + DMEM_PAGE_SIZE - 1)) +#define DMEM_PAGE_DOWN(x) phys_to_dpage(x) + +#define dpage_to_phys(_dpage) \ + ((_dpage) << dmem_pool.dpage_shift) +#define phys_to_dpage(_addr) \ + ((_addr) >> dmem_pool.dpage_shift) + +#define dpage_to_pfn(_dpage) \ + (__phys_to_pfn(dpage_to_phys(_dpage))) +#define pfn_to_dpage(_pfn) \ + (phys_to_dpage(__pfn_to_phys(_pfn))) + +#define dnode_to_nid(_dnode) \ + ((_dnode) - dmem_pool.nodes) +#define nid_to_dnode(nid) \ + (&dmem_pool.nodes[nid]) + #define for_each_dmem_node(_dnode) \ for (_dnode = dmem_pool.nodes; \ _dnode < dmem_pool.nodes + ARRAY_SIZE(dmem_pool.nodes); \ _dnode++) +#define for_each_dmem_region(_dnode, _dregion) \ + list_for_each_entry(_dregion, &(_dnode)->regions, node) + +static inline int *dmem_nodelist(int nid) +{ + return nid_to_dnode(nid)->nodelist; +} + void __init dmem_init(void) { struct dmem_node *dnode; @@ -135,3 +162,649 @@ int dmem_region_register(int node, phys_addr_t start, phys_addr_t end) return 0; } +#define PENALTY_FOR_DMEM_SHARED_NODE (1) + +static int dmem_nodeload[MAX_NUMNODES] __initdata; + +/* Evaluate penalty for each dmem node */ +static int __init dmem_evaluate_node(int local, int node) +{ + int penalty; + + /* Use the distance array to find the distance */ + penalty = node_distance(local, node); + + /* Penalize nodes under us ("prefer the next node") */ + penalty += (node < local); + + /* Give preference to headless and unused nodes */ + if (!cpumask_empty(cpumask_of_node(node))) + penalty += PENALTY_FOR_NODE_WITH_CPUS; + + /* Penalize dmem-node shared with kernel */ + if (node_state(node, N_MEMORY)) + penalty += PENALTY_FOR_DMEM_SHARED_NODE; + + /* Slight preference for less loaded node */ + penalty *= (nr_online_nodes * MAX_NUMNODES); + + penalty += dmem_nodeload[node]; + + return penalty; +} + +static int __init find_next_dmem_node(int local, nodemask_t *used_nodes) +{ + struct dmem_node *dnode; + int node, best_node = NUMA_NO_NODE; + int penalty, min_penalty = INT_MAX; + + /* Invalid node is not suitable to call node_distance */ + if (!node_state(local, N_POSSIBLE)) + return NUMA_NO_NODE; + + /* Use the local node if we haven't already */ + if (!node_isset(local, *used_nodes)) { + node_set(local, *used_nodes); + return local; + } + + for_each_dmem_node(dnode) { + if (list_empty(&dnode->regions)) + continue; + + node = dnode_to_nid(dnode); + + /* Don't want a node to appear more than once */ + if (node_isset(node, *used_nodes)) + continue; + + penalty = dmem_evaluate_node(local, node); + + if (penalty < min_penalty) { + min_penalty = penalty; + best_node = node; + } + } + + if (best_node >= 0) + node_set(best_node, *used_nodes); + + return best_node; +} + +static int __init dmem_node_init(struct dmem_node *dnode) +{ + int *nodelist; + nodemask_t used_nodes; + int local, node, prev; + int load; + int i = 0; + + nodelist = dnode->nodelist; + nodes_clear(used_nodes); + local = dnode_to_nid(dnode); + prev = local; + load = nr_online_nodes; + + while ((node = find_next_dmem_node(local, &used_nodes)) >= 0) { + /* + * We don't want to pressure a particular node. + * So adding penalty to the first node in same + * distance group to make it round-robin. + */ + if (node_distance(local, node) != node_distance(local, prev)) + dmem_nodeload[node] = load; + + nodelist[i++] = prev = node; + load--; + } + + return 0; +} + +static void __init dmem_region_uinit(struct dmem_region *dregion) +{ + unsigned long nr_pages, size, *bitmap = dregion->error_bitmap; + + if (!bitmap) + return; + + nr_pages = __phys_to_pfn(dregion->reserved_end_addr) + - __phys_to_pfn(dregion->reserved_start_addr); + + WARN_ON(!nr_pages); + + size = BITS_TO_LONGS(nr_pages) * sizeof(long); + if (size > sizeof(dregion->static_bitmap)) + kfree(bitmap); + dregion->error_bitmap = NULL; +} + +/* + * we only stop allocator to use the reserved page and do not + * reture pages back if anything goes wrong + */ +static void __init dmem_uinit(void) +{ + struct dmem_region *dregion, *dr; + struct dmem_node *dnode; + + for_each_dmem_node(dnode) { + dnode->nodelist[0] = NUMA_NO_NODE; + list_for_each_entry_safe(dregion, dr, &dnode->regions, node) { + dmem_region_uinit(dregion); + dregion->reserved_start_addr = + dregion->reserved_end_addr = 0; + list_del(&dregion->node); + } + } + + dmem_pool.region_num = 0; + dmem_pool.registered_pages = 0; +} + +static int __init dmem_region_init(struct dmem_region *dregion) +{ + unsigned long *bitmap, size, nr_pages; + + nr_pages = __phys_to_pfn(dregion->reserved_end_addr) + - __phys_to_pfn(dregion->reserved_start_addr); + + size = BITS_TO_LONGS(nr_pages) * sizeof(long); + if (size <= sizeof(dregion->static_error_bitmap)) { + bitmap = &dregion->static_error_bitmap; + } else { + bitmap = kzalloc(size, GFP_KERNEL); + if (!bitmap) + return -ENOMEM; + } + dregion->error_bitmap = bitmap; + return 0; +} + +/* + * dmem memory is not 'struct page' backend, i.e, the kernel threats + * it as invalid pfn + */ +static int __init dmem_check_region(struct dmem_region *dregion) +{ + unsigned long pfn; + + for (pfn = __phys_to_pfn(dregion->reserved_start_addr); + pfn < __phys_to_pfn(dregion->reserved_end_addr); pfn++) { + if (!WARN_ON(pfn_valid(pfn))) + continue; + + pr_err("dmem: check pfn %#lx failed, its memory was not properly reserved\n", + pfn); + return -EINVAL; + } + + return 0; +} + +static int __init dmem_late_init(void) +{ + struct dmem_region *dregion; + struct dmem_node *dnode; + int ret; + + for_each_dmem_node(dnode) { + dmem_node_init(dnode); + + for_each_dmem_region(dnode, dregion) { + ret = dmem_region_init(dregion); + if (ret) + goto exit; + ret = dmem_check_region(dregion); + if (ret) + goto exit; + } + } + return ret; +exit: + dmem_uinit(); + return ret; +} +late_initcall(dmem_late_init); + +static int dmem_alloc_region_init(struct dmem_region *dregion, + unsigned long *dpages) +{ + unsigned long start, end, *bitmap, size; + + start = DMEM_PAGE_UP(dregion->reserved_start_addr); + end = DMEM_PAGE_DOWN(dregion->reserved_end_addr); + + *dpages = end - start; + if (!*dpages) + return 0; + + size = BITS_TO_LONGS(*dpages) * sizeof(long); + if (size <= sizeof(dregion->static_bitmap)) + bitmap = &dregion->static_bitmap; + else { + bitmap = kzalloc(size, GFP_KERNEL); + if (!bitmap) + return -ENOMEM; + } + + dregion->bitmap = bitmap; + dregion->next_free_pos = 0; + dregion->dpage_start_pfn = start; + dregion->dpage_end_pfn = end; + + dmem_pool.unaligned_pages += __phys_to_pfn((dpage_to_phys(start) + - dregion->reserved_start_addr)); + dmem_pool.unaligned_pages += __phys_to_pfn(dregion->reserved_end_addr + - dpage_to_phys(end)); + return 0; +} + +static bool dmem_dpage_is_error(struct dmem_region *dregion, phys_addr_t dpage) +{ + unsigned long valid_pages; + unsigned long pos_pfn, pos_offset; + unsigned long pages_per_dpage = DMEM_PAGE_SIZE >> PAGE_SHIFT; + phys_addr_t reserved_start_pfn; + + reserved_start_pfn = __phys_to_pfn(dregion->reserved_start_addr); + valid_pages = dpage_to_pfn(dregion->dpage_end_pfn) - reserved_start_pfn; + + pos_offset = dpage_to_pfn(dpage) - reserved_start_pfn; + pos_pfn = find_next_bit(dregion->error_bitmap, valid_pages, pos_offset); + if (pos_pfn < pos_offset + pages_per_dpage) + return true; + return false; +} + +static unsigned long +dmem_alloc_bitmap_clear(struct dmem_region *dregion, phys_addr_t dpage, + unsigned int dpages_nr) +{ + u64 pos = dpage - dregion->dpage_start_pfn; + unsigned int i; + unsigned long err_num = 0; + + for (i = 0; i < dpages_nr; i++) { + if (dmem_dpage_is_error(dregion, dpage + i)) { + WARN_ON(!test_bit(pos + i, dregion->bitmap)); + err_num++; + } else { + WARN_ON(!__test_and_clear_bit(pos + i, + dregion->bitmap)); + } + } + return err_num; +} + +/* set or clear corresponding bit on allocation bitmap based on error bitmap */ +static unsigned long dregion_alloc_bitmap_set_clear(struct dmem_region *dregion, + bool set) +{ + unsigned long pos_pfn, pos_offset; + unsigned long valid_pages, mce_dpages = 0; + phys_addr_t dpage, reserved_start_pfn; + + reserved_start_pfn = __phys_to_pfn(dregion->reserved_start_addr); + + valid_pages = dpage_to_pfn(dregion->dpage_end_pfn) - reserved_start_pfn; + pos_offset = dpage_to_pfn(dregion->dpage_start_pfn) + - reserved_start_pfn; +try_set: + pos_pfn = find_next_bit(dregion->error_bitmap, valid_pages, pos_offset); + + if (pos_pfn >= valid_pages) + return mce_dpages; + mce_dpages++; + dpage = pfn_to_dpage(pos_pfn + reserved_start_pfn); + if (set) + WARN_ON(__test_and_set_bit(dpage - dregion->dpage_start_pfn, + dregion->bitmap)); + else + WARN_ON(!__test_and_clear_bit(dpage - dregion->dpage_start_pfn, + dregion->bitmap)); + pos_offset = dpage_to_pfn(dpage + 1) - reserved_start_pfn; + goto try_set; +} + +static void dmem_uinit_check_alloc_bitmap(struct dmem_region *dregion) +{ + unsigned long dpages, size; + + dregion_alloc_bitmap_set_clear(dregion, false); + + dpages = dregion->dpage_end_pfn - dregion->dpage_start_pfn; + size = BITS_TO_LONGS(dpages) * sizeof(long); + WARN_ON(!bitmap_empty(dregion->bitmap, size * BITS_PER_BYTE)); +} + +static void dmem_alloc_region_uinit(struct dmem_region *dregion) +{ + unsigned long dpages, size, *bitmap = dregion->bitmap; + + if (!bitmap) + return; + + dpages = dregion->dpage_end_pfn - dregion->dpage_start_pfn; + WARN_ON(!dpages); + + dmem_uinit_check_alloc_bitmap(dregion); + + size = BITS_TO_LONGS(dpages) * sizeof(long); + if (size > sizeof(dregion->static_bitmap)) + kfree(bitmap); + dregion->bitmap = NULL; +} + +static void __dmem_alloc_uinit(void) +{ + struct dmem_node *dnode; + struct dmem_region *dregion; + + if (!dmem_pool.dpage_shift) + return; + + dmem_pool.unaligned_pages = 0; + + for_each_dmem_node(dnode) { + for_each_dmem_region(dnode, dregion) + dmem_alloc_region_uinit(dregion); + + dnode->total_dpages = dnode->free_dpages = 0; + } + + dmem_pool.dpage_shift = 0; + dmem_pool.total_dpages = dmem_pool.free_dpages = 0; +} + +static void dnode_count_free_dpages(struct dmem_node *dnode, long dpages) +{ + dnode->free_dpages += dpages; + dmem_pool.free_dpages += dpages; +} + +/* + * uninitialize dmem allocator + * + * all dpages should be freed before calling it + */ +void dmem_alloc_uinit(void) +{ + mutex_lock(&dmem_pool.lock); + if (!--dmem_pool.user_count) + __dmem_alloc_uinit(); + mutex_unlock(&dmem_pool.lock); +} +EXPORT_SYMBOL(dmem_alloc_uinit); + +/* + * initialize dmem allocator + * @dpage_shift: the shift bits of dmem page size used to manange + * dmem memory, it should be CPU's nature page size at least + * + * Note: the page size the allocator used isn't the same thing with + * the alignment used to reserve dmem memory + */ +int dmem_alloc_init(unsigned long dpage_shift) +{ + struct dmem_node *dnode; + struct dmem_region *dregion; + unsigned long dpages; + int ret = 0; + + if (dpage_shift < PAGE_SHIFT) + return -EINVAL; + + mutex_lock(&dmem_pool.lock); + + if (dmem_pool.dpage_shift) { + /* + * double init on the same page size is okay + * to make the unit tests happy + */ + if (dmem_pool.dpage_shift != dpage_shift) + ret = -EBUSY; + + goto exit; + } + + dmem_pool.dpage_shift = dpage_shift; + + for_each_dmem_node(dnode) { + for_each_dmem_region(dnode, dregion) { + ret = dmem_alloc_region_init(dregion, &dpages); + if (ret < 0) { + __dmem_alloc_uinit(); + goto exit; + } + + dnode_count_free_dpages(dnode, dpages); + } + dnode->total_dpages = dnode->free_dpages; + } + + dmem_pool.total_dpages = dmem_pool.free_dpages; + + if (dmem_pool.unaligned_pages && !ret) + pr_warn("dmem: %llu pages are wasted due to alignment\n", + (unsigned long long)dmem_pool.unaligned_pages); +exit: + if (!ret) + dmem_pool.user_count++; + + mutex_unlock(&dmem_pool.lock); + return ret; +} +EXPORT_SYMBOL(dmem_alloc_init); + +static phys_addr_t +dmem_alloc_region_page(struct dmem_region *dregion, unsigned int try_max, + unsigned int *result_nr) +{ + unsigned long pos, dpages; + unsigned int i; + + /* no dpage is available in this region */ + if (!dregion->bitmap) + return 0; + + dpages = dregion->dpage_end_pfn - dregion->dpage_start_pfn; + + /* no free page in this region */ + if (dregion->next_free_pos >= dpages) + return 0; + + pos = find_next_zero_bit(dregion->bitmap, dpages, + dregion->next_free_pos); + if (pos >= dpages) { + dregion->next_free_pos = pos; + return 0; + } + + __set_bit(pos, dregion->bitmap); + + /* do not go beyond the region */ + try_max = min(try_max, (unsigned int)(dpages - pos - 1)); + for (i = 1; i < try_max; i++) + if (__test_and_set_bit(pos + i, dregion->bitmap)) + break; + + *result_nr = i; + dregion->next_free_pos = pos + *result_nr; + return dpage_to_phys(dregion->dpage_start_pfn + pos); +} + +/* + * allocate dmem pages from the nodelist + * + * @nodelist: dmem_node's nodelist + * @nodemask: nodemask for filtering the dmem nodelist + * @try_max: try to allocate @try_max dpages if possible + * @result_nr: allocated dpage number returned to the caller + * + * return the physical address of the first dpage allocated from dmem + * pool, or 0 on failure. The allocated dpage number is filled into + * @result_nr + */ +static phys_addr_t +dmem_alloc_pages_from_nodelist(int *nodelist, nodemask_t *nodemask, + unsigned int try_max, unsigned int *result_nr) +{ + struct dmem_node *dnode; + struct dmem_region *dregion; + phys_addr_t addr = 0; + int node, i; + unsigned int local_result_nr; + + WARN_ON(try_max > 1 && !result_nr); + + if (!result_nr) + result_nr = &local_result_nr; + + *result_nr = 0; + + for (i = 0; !addr && i < ARRAY_SIZE(dnode->nodelist); i++) { + node = nodelist[i]; + + if (nodemask && !node_isset(node, *nodemask)) + continue; + + mutex_lock(&dmem_pool.lock); + + WARN_ON(!dmem_pool.dpage_shift); + + dnode = &dmem_pool.nodes[node]; + for_each_dmem_region(dnode, dregion) { + addr = dmem_alloc_region_page(dregion, try_max, + result_nr); + if (addr) { + dnode_count_free_dpages(dnode, + -(long)(*result_nr)); + break; + } + } + + mutex_unlock(&dmem_pool.lock); + } + return addr; +} + +/* + * allocate a dmem page from the dmem pool and try to allocate more + * continuous dpages if @try_max is not less than 1 + * + * @nid: the NUMA node the dmem page got from + * @nodemask: nodemask for filtering the dmem nodelist + * @try_max: try to allocate @try_max dpages if possible + * @result_nr: allocated dpage number returned to the caller + * + * return the physical address of the first dpage allocated from dmem + * pool, or 0 on failure. The allocated dpage number is filled into + * @result_nr + */ +phys_addr_t +dmem_alloc_pages_nodemask(int nid, nodemask_t *nodemask, unsigned int try_max, + unsigned int *result_nr) +{ + int *nodelist; + + if (nid >= sizeof(ARRAY_SIZE(dmem_pool.nodes))) + return 0; + + nodelist = dmem_nodelist(nid); + return dmem_alloc_pages_from_nodelist(nodelist, nodemask, + try_max, result_nr); +} +EXPORT_SYMBOL(dmem_alloc_pages_nodemask); + +/* + * dmem_alloc_pages_vma - Allocate pages for a VMA. + * + * @vma: Pointer to VMA or NULL if not available. + * @addr: Virtual Address of the allocation. Must be inside the VMA. + * @try_max: try to allocate @try_max dpages if possible + * @result_nr: allocated dpage number returned to the caller + * + * Return the physical address of the first dpage allocated from dmem + * pool, or 0 on failure. The allocated dpage number is filled into + * @result_nr + */ +phys_addr_t +dmem_alloc_pages_vma(struct vm_area_struct *vma, unsigned long addr, + unsigned int try_max, unsigned int *result_nr) +{ + phys_addr_t phys_addr; + int *nl; + unsigned int cpuset_mems_cookie; + +retry_cpuset: + nl = dmem_nodelist(numa_node_id()); + + phys_addr = dmem_alloc_pages_from_nodelist(nl, NULL, try_max, + result_nr); + if (unlikely(!phys_addr && read_mems_allowed_retry(cpuset_mems_cookie))) + goto retry_cpuset; + + return phys_addr; +} +EXPORT_SYMBOL(dmem_alloc_pages_vma); + +/* + * Don't need to call it in a lock. + * This function uses the reserved addresses those are initially registered + * and will not be modified at run time. + */ +static struct dmem_region *find_dmem_region(phys_addr_t phys_addr, + struct dmem_node **pdnode) +{ + struct dmem_node *dnode; + struct dmem_region *dregion; + + for_each_dmem_node(dnode) + for_each_dmem_region(dnode, dregion) { + if (dregion->reserved_start_addr > phys_addr) + continue; + if (dregion->reserved_end_addr <= phys_addr) + continue; + + *pdnode = dnode; + return dregion; + } + + return NULL; +} + +/* + * free dmem page to the dmem pool + * @addr: the physical addree will be freed + * @dpage_nr: the number of dpage to be freed + */ +void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr) +{ + struct dmem_region *dregion; + struct dmem_node *pdnode = NULL; + phys_addr_t dpage = phys_to_dpage(addr); + u64 pos; + unsigned long err_dpages; + + mutex_lock(&dmem_pool.lock); + + WARN_ON(!dmem_pool.dpage_shift); + + dregion = find_dmem_region(addr, &pdnode); + WARN_ON(!dregion || !dregion->bitmap || !pdnode); + + pos = dpage - dregion->dpage_start_pfn; + dregion->next_free_pos = min(dregion->next_free_pos, pos); + + /* it is not possible to span multiple regions */ + WARN_ON(dpage + dpages_nr - 1 >= dregion->dpage_end_pfn); + + err_dpages = dmem_alloc_bitmap_clear(dregion, dpage, dpages_nr); + + dnode_count_free_dpages(pdnode, dpages_nr - err_dpages); + mutex_unlock(&dmem_pool.lock); +} +EXPORT_SYMBOL(dmem_free_pages); From patchwork Thu Oct 8 07:53:54 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822313 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 24F5413B2 for ; Thu, 8 Oct 2020 07:54:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id F14D421924 for ; Thu, 8 Oct 2020 07:53:59 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="SjRXwrLX" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728567AbgJHHx4 (ORCPT ); Thu, 8 Oct 2020 03:53:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51854 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728363AbgJHHxu (ORCPT ); Thu, 8 Oct 2020 03:53:50 -0400 Received: from mail-pg1-x541.google.com (mail-pg1-x541.google.com [IPv6:2607:f8b0:4864:20::541]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EA359C061755; Thu, 8 Oct 2020 00:53:49 -0700 (PDT) Received: by mail-pg1-x541.google.com with SMTP id i2so3580518pgh.7; Thu, 08 Oct 2020 00:53:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=PBnE1QAo7Jwl736qTOTO4uTCYWWc54L+TMxGG3Lu6zw=; b=SjRXwrLX/YPbydqVxvF92oogIetb81hrvhOGQ+1tLAG6UPk0B4gA9UxrlbncFL2aFy WDDqiRqQoIdbh3Ov6HaZwsFfl+/2U2kL9ypMeEi5XA/zsAoymnW3Ihx9m6rveNm/p4Y8 edaZaQCyqxKCX9MNi+StOfXIRHCagYZzBlQ+DuanG53rM/RCyghsmT1jZeNPyCQ1zQ/L 39dV2kaOtrRdeWrZRaowzSuRQiLUUZydf20C1wl9keC92SCUpYnjJl2/1boT4L5vNYGv tOclGMs5RRw1B1dFU53tprCQ2YFF9df8B7d2nSwQBeem6rqCPovWCO0xTtnthGny1Cs6 X0Aw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=PBnE1QAo7Jwl736qTOTO4uTCYWWc54L+TMxGG3Lu6zw=; b=Kwl0dZ6y6IGbneFE2ncos1JPee4xSLjCxVjCWnWML5eYMhnMaav5rd1LBY88N4tSVQ O8iYxOMBi3ZpXiE+Fgy9CwxKjoyBzjhHEzeo6oAjOWOAHlmSfAO61dEx0tIKshT0D2ds Hvef+cpa3RDFaDjZI9V7qSDXclUPkXJHUX3dG8Gj1phKl6m73ESPMW5m8pzkXW6jnKWw xK4+cMPNIGrx6Q93YTxmdntB5p9po/DUsXYb3Hmv+LboxWmC3dSF2tTNsYA27YNgcB2c SpsmrDlmWht1dEP3Wprkq4crokaZFafZVx9TxHJcUdl5TuAFqDrSL3RiCC8reP65ZynR UVrw== X-Gm-Message-State: AOAM532jnRMoWIemlXWjI992FX1nRLWaL90S8Np1owg+GyH9bHv/g6lM IyXWnnbzUcyYOynH/hNRfLc= X-Google-Smtp-Source: ABdhPJyf24CY0Eqdhy9GaPXlHYy0bP9Om/iDF3IWFI+hRiHc6/ht/JP1H29YVbI1naEd4oxhPnZquw== X-Received: by 2002:a63:e444:: with SMTP id i4mr6354788pgk.304.1602143629582; Thu, 08 Oct 2020 00:53:49 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.53.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:53:49 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [PATCH 04/35] dmem: let pat recognize dmem Date: Thu, 8 Oct 2020 15:53:54 +0800 Message-Id: <87e23dfbac6f4a68e61d91cddfdfe157163975c1.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang x86 pat uses 'struct page' by only checking if it's system ram, however it is not true if dmem is used, let's teach pat to recognize this case if it is ram but it is !pfn_valid() We always use WB for dmem and any attempt to change this behavior will be rejected and WARN_ON is triggered Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- arch/x86/mm/pat/memtype.c | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c index 8f665c352bf0..fd8a298fc30b 100644 --- a/arch/x86/mm/pat/memtype.c +++ b/arch/x86/mm/pat/memtype.c @@ -511,6 +511,13 @@ static int reserve_ram_pages_type(u64 start, u64 end, for (pfn = (start >> PAGE_SHIFT); pfn < (end >> PAGE_SHIFT); ++pfn) { enum page_cache_mode type; + /* + * it's dmem if it's ram but not 'struct page' backend, + * we always use WB + */ + if (WARN_ON(!pfn_valid(pfn))) + return -EBUSY; + page = pfn_to_page(pfn); type = get_page_memtype(page); if (type != _PAGE_CACHE_MODE_WB) { @@ -539,6 +546,13 @@ static int free_ram_pages_type(u64 start, u64 end) u64 pfn; for (pfn = (start >> PAGE_SHIFT); pfn < (end >> PAGE_SHIFT); ++pfn) { + /* + * it's dmem, see the comments in + * reserve_ram_pages_type() + */ + if (WARN_ON(!pfn_valid(pfn))) + continue; + page = pfn_to_page(pfn); set_page_memtype(page, _PAGE_CACHE_MODE_WB); } @@ -714,6 +728,13 @@ static enum page_cache_mode lookup_memtype(u64 paddr) if (pat_pagerange_is_ram(paddr, paddr + PAGE_SIZE)) { struct page *page; + /* + * dmem always uses WB, see the comments in + * reserve_ram_pages_type() + */ + if (!pfn_valid(paddr >> PAGE_SHIFT)) + return rettype; + page = pfn_to_page(paddr >> PAGE_SHIFT); return get_page_memtype(page); } From patchwork Thu Oct 8 07:53:55 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822317 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D5A0E109B for ; Thu, 8 Oct 2020 07:54:05 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9A5AD21924 for ; Thu, 8 Oct 2020 07:54:05 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="HgTTv/Fk" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728633AbgJHHyD (ORCPT ); Thu, 8 Oct 2020 03:54:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51868 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728557AbgJHHxy (ORCPT ); Thu, 8 Oct 2020 03:53:54 -0400 Received: from mail-pf1-x441.google.com (mail-pf1-x441.google.com [IPv6:2607:f8b0:4864:20::441]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A32FFC0613D3; Thu, 8 Oct 2020 00:53:54 -0700 (PDT) Received: by mail-pf1-x441.google.com with SMTP id a200so3305435pfa.10; Thu, 08 Oct 2020 00:53:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=aD9L0attICZ331NkzuhZKqK6mFxwDzbVdW4veU3KMMw=; b=HgTTv/FkIiJMdNNQp987wEyOy3Q12qbr3PjH4smEP6yNvBTIMv2dWHSOoQoSkXwPpZ HHV9fT2rTtx+w+KY0xL9pOzhezL0IE7SYYwZu92NAS/4OwvJ12NvRjM/Sx49Vj0TikNB HHHfh19MTjRa3cTzIKfrggNgIGXYYTmGSnSb8y/pfUOpbGRqpZKDP5jCrKkYUsOKm0LR bk/58zK2GmfGhYwPrH2Q1WgsJmJE5KTQE+h92aKvzOdcUCFM1WNTVSvFkD/cxL97XX2h vo8n5dtbN77vXf5O9bHPG7e5OgNj1Q9WZJK5VZsA79FyENyk6zO2GEaBGLLu5daPc8tI WknQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=aD9L0attICZ331NkzuhZKqK6mFxwDzbVdW4veU3KMMw=; b=hwRrhmBEiAal21tkpwlMML37jsz1eJ8JmXuKll6jHPa2WdRu5nmkB804Gioer9QbQN Hm55J/DNRyQynE0R0mQG3flbbe0DtnzBNhZfaRUV4TUk/oZmEV12vdV2KuDQyGbA1CBI 2rCVrWvnigL3EAW8v4fpiJpAdFW6nEWYT/71krFabk9lvXRGHqQvB7mJW12W7e2DpRrn vXtO33iWPHw4XN98Sdy+FsbNf5LNT57Gnoy5EbiyJnATW/+dOF6fV2SE4ndmgpgFc1cC rGAHvfMVWqPbGUg5u0mOXrUxNNeNlxveeWOyIXrppGP5kozbidi4F30qyss8U+XgoEGO N77Q== X-Gm-Message-State: AOAM530/AYCzl1J6QZM/OjP2ViadKsm6IHmrMih9GCO7mmUW9WtR2xMm f2AOptm3FZzuQ8O+3Of+Z/A= X-Google-Smtp-Source: ABdhPJw6kWy+8fvY7bQ35dlUoUr3li5mAMaRpQ+uDDQ3OW0COPRaqCwU6k+iEoWJb/w1mYDl6IbaZQ== X-Received: by 2002:a17:90b:4c0d:: with SMTP id na13mr7136208pjb.102.1602143634205; Thu, 08 Oct 2020 00:53:54 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.53.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:53:53 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [PATCH 05/35] dmemfs: support mmap Date: Thu, 8 Oct 2020 15:53:55 +0800 Message-Id: <21b236c361e48a8e1118c681570dbe79ac7336db.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang It adds mmap support. Note the file will be extended if it's beyond mmap's offset, that drops the requirement of write() operation, however, it has not supported cutting file down. Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- fs/dmemfs/inode.c | 337 ++++++++++++++++++++++++++++++++++++++++++- include/linux/dmem.h | 10 ++ 2 files changed, 345 insertions(+), 2 deletions(-) diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index 6a8a2d9f94e9..21d2f951b4ea 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -26,6 +26,7 @@ #include #include #include +#include MODULE_AUTHOR("Tencent Corporation"); MODULE_LICENSE("GPL v2"); @@ -105,8 +106,250 @@ static const struct inode_operations dmemfs_file_inode_operations = { .getattr = simple_getattr, }; +static unsigned long dmem_pgoff_to_index(struct inode *inode, pgoff_t pgoff) +{ + struct super_block *sb = inode->i_sb; + + return pgoff >> (sb->s_blocksize_bits - PAGE_SHIFT); +} + +static void *dmem_addr_to_entry(struct inode *inode, phys_addr_t addr) +{ + struct super_block *sb = inode->i_sb; + + addr >>= sb->s_blocksize_bits; + return xa_mk_value(addr); +} + +static phys_addr_t dmem_entry_to_addr(struct inode *inode, void *entry) +{ + struct super_block *sb = inode->i_sb; + + WARN_ON(!xa_is_value(entry)); + return xa_to_value(entry) << sb->s_blocksize_bits; +} + +static unsigned long +dmem_addr_to_pfn(struct inode *inode, phys_addr_t addr, pgoff_t pgoff, + unsigned int fault_shift) +{ + struct super_block *sb = inode->i_sb; + unsigned long pfn = addr >> PAGE_SHIFT; + unsigned long mask; + + mask = (1UL << ((unsigned int)sb->s_blocksize_bits - fault_shift)) - 1; + mask <<= fault_shift - PAGE_SHIFT; + + return pfn + (pgoff & mask); +} + +static inline unsigned long dmem_page_size(struct inode *inode) +{ + return inode->i_sb->s_blocksize; +} + +static int check_inode_size(struct inode *inode, loff_t offset) +{ + WARN_ON_ONCE(!rcu_read_lock_held()); + + if (offset >= i_size_read(inode)) + return -EINVAL; + + return 0; +} + +static unsigned +dmemfs_find_get_entries(struct address_space *mapping, unsigned long start, + unsigned int nr_entries, void **entries, + unsigned long *indices) +{ + XA_STATE(xas, &mapping->i_pages, start); + + void *entry; + unsigned int ret = 0; + + if (!nr_entries) + return 0; + + rcu_read_lock(); + + xas_for_each(&xas, entry, ULONG_MAX) { + if (xas_retry(&xas, entry)) + continue; + + if (xa_is_value(entry)) + goto export; + + if (unlikely(entry != xas_reload(&xas))) + goto retry; + +export: + indices[ret] = xas.xa_index; + entries[ret] = entry; + if (++ret == nr_entries) + break; + continue; +retry: + xas_reset(&xas); + } + rcu_read_unlock(); + return ret; +} + +static void *find_radix_entry_or_next(struct address_space *mapping, + unsigned long start, + unsigned long *eindex) +{ + void *entry = NULL; + + dmemfs_find_get_entries(mapping, start, 1, &entry, eindex); + return entry; +} + +/* + * find the entry in radix tree based on @index, create it if + * it does not exist + * + * return the entry with rcu locked, otherwise ERR_PTR() + * is returned + */ +static void * +radix_get_create_entry(struct vm_area_struct *vma, unsigned long fault_addr, + struct inode *inode, pgoff_t pgoff) +{ + struct address_space *mapping = inode->i_mapping; + unsigned long eindex, index; + loff_t offset; + phys_addr_t addr; + gfp_t gfp_masks = mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM; + void *entry; + unsigned int try_dpages, dpages; + int ret; + +retry: + offset = ((loff_t)pgoff << PAGE_SHIFT); + index = dmem_pgoff_to_index(inode, pgoff); + rcu_read_lock(); + ret = check_inode_size(inode, offset); + if (ret) { + rcu_read_unlock(); + return ERR_PTR(ret); + } + + try_dpages = dmem_pgoff_to_index(inode, (i_size_read(inode) - offset) + >> PAGE_SHIFT); + entry = find_radix_entry_or_next(mapping, index, &eindex); + if (entry) { + WARN_ON(!xa_is_value(entry)); + if (eindex == index) + return entry; + + WARN_ON(eindex <= index); + try_dpages = eindex - index; + } + rcu_read_unlock(); + + /* entry does not exist, create it */ + addr = dmem_alloc_pages_vma(vma, fault_addr, try_dpages, &dpages); + if (!addr) { + /* + * do not return -ENOMEM as that will trigger OOM, + * it is useless for reclaiming dmem page + */ + ret = -EINVAL; + goto exit; + } + + try_dpages = dpages; + while (dpages) { + rcu_read_lock(); + ret = check_inode_size(inode, offset); + if (ret) + goto unlock_rcu; + + entry = dmem_addr_to_entry(inode, addr); + entry = xa_store(&mapping->i_pages, index, entry, gfp_masks); + if (!xa_is_err(entry)) { + addr += inode->i_sb->s_blocksize; + offset += inode->i_sb->s_blocksize; + dpages--; + mapping->nrexceptional++; + index++; + } + +unlock_rcu: + rcu_read_unlock(); + if (ret) + break; + } + + if (dpages) + dmem_free_pages(addr, dpages); + + /* we have created some entries, let's retry it */ + if (ret == -EEXIST || try_dpages != dpages) + goto retry; +exit: + return ERR_PTR(ret); +} + +static void radix_put_entry(void) +{ + rcu_read_unlock(); +} + +static vm_fault_t dmemfs_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct inode *inode = file_inode(vma->vm_file); + phys_addr_t addr; + void *entry; + int ret; + + if (vmf->pgoff > (MAX_LFS_FILESIZE >> PAGE_SHIFT)) + return VM_FAULT_SIGBUS; + + entry = radix_get_create_entry(vma, (unsigned long)vmf->address, + inode, vmf->pgoff); + if (IS_ERR(entry)) { + ret = PTR_ERR(entry); + goto exit; + } + + addr = dmem_entry_to_addr(inode, entry); + ret = vmf_insert_pfn(vma, (unsigned long)vmf->address, + dmem_addr_to_pfn(inode, addr, vmf->pgoff, + PAGE_SHIFT)); + radix_put_entry(); + +exit: + return ret; +} + +static unsigned long dmemfs_pagesize(struct vm_area_struct *vma) +{ + return dmem_page_size(file_inode(vma->vm_file)); +} + +static const struct vm_operations_struct dmemfs_vm_ops = { + .fault = dmemfs_fault, + .pagesize = dmemfs_pagesize, +}; + int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma) { + struct inode *inode = file_inode(file); + + if (vma->vm_pgoff & ((dmem_page_size(inode) - 1) >> PAGE_SHIFT)) + return -EINVAL; + + if (!(vma->vm_flags & VM_SHARED)) + return -EINVAL; + + vma->vm_flags |= VM_PFNMAP; + + file_accessed(file); + vma->vm_ops = &dmemfs_vm_ops; return 0; } @@ -189,9 +432,86 @@ static int dmemfs_statfs(struct dentry *dentry, struct kstatfs *buf) return 0; } +/* + * should make sure the dmem page in the dropped region is not + * being mapped by any process + */ +static void inode_drop_dpages(struct inode *inode, loff_t start, loff_t end) +{ + struct address_space *mapping = inode->i_mapping; + struct pagevec pvec; + unsigned long istart, iend, indices[PAGEVEC_SIZE]; + int i; + + /* we never use normap page */ + WARN_ON(mapping->nrpages); + + /* if no dpage is allocated for the inode */ + if (!mapping->nrexceptional) + return; + + istart = dmem_pgoff_to_index(inode, start >> PAGE_SHIFT); + iend = dmem_pgoff_to_index(inode, end >> PAGE_SHIFT); + pagevec_init(&pvec); + while (istart < iend) { + pvec.nr = dmemfs_find_get_entries(mapping, istart, + min(iend - istart, + (unsigned long)PAGEVEC_SIZE), + (void **)pvec.pages, + indices); + if (!pvec.nr) + break; + + for (i = 0; i < pagevec_count(&pvec); i++) { + phys_addr_t addr; + + istart = indices[i]; + if (istart >= iend) + break; + + xa_erase(&mapping->i_pages, istart); + mapping->nrexceptional--; + + addr = dmem_entry_to_addr(inode, pvec.pages[i]); + dmem_free_page(addr); + } + + /* + * only exception entries in pagevec, it's safe to + * reinit it + */ + pagevec_reinit(&pvec); + cond_resched(); + istart++; + } +} + +static void dmemfs_evict_inode(struct inode *inode) +{ + /* no VMA works on it */ + WARN_ON(!RB_EMPTY_ROOT(&inode->i_data.i_mmap.rb_root)); + + inode_drop_dpages(inode, 0, LLONG_MAX); + clear_inode(inode); +} + +/* + * Display the mount options in /proc/mounts. + */ +static int dmemfs_show_options(struct seq_file *m, struct dentry *root) +{ + struct dmemfs_fs_info *fsi = root->d_sb->s_fs_info; + + if (check_dpage_size(fsi->mount_opts.dpage_size)) + seq_printf(m, ",pagesize=%lx", fsi->mount_opts.dpage_size); + return 0; +} + static const struct super_operations dmemfs_ops = { .statfs = dmemfs_statfs, + .evict_inode = dmemfs_evict_inode, .drop_inode = generic_delete_inode, + .show_options = dmemfs_show_options, }; static int @@ -199,6 +519,7 @@ dmemfs_fill_super(struct super_block *sb, struct fs_context *fc) { struct inode *inode; struct dmemfs_fs_info *fsi = sb->s_fs_info; + int ret; sb->s_maxbytes = MAX_LFS_FILESIZE; sb->s_blocksize = fsi->mount_opts.dpage_size; @@ -207,11 +528,17 @@ dmemfs_fill_super(struct super_block *sb, struct fs_context *fc) sb->s_op = &dmemfs_ops; sb->s_time_gran = 1; + ret = dmem_alloc_init(sb->s_blocksize_bits); + if (ret) + return ret; + inode = dmemfs_get_inode(sb, NULL, S_IFDIR, 0); sb->s_root = d_make_root(inode); - if (!sb->s_root) - return -ENOMEM; + if (!sb->s_root) { + dmem_alloc_uinit(); + return -ENOMEM; + } return 0; } @@ -247,7 +574,13 @@ int dmemfs_init_fs_context(struct fs_context *fc) static void dmemfs_kill_sb(struct super_block *sb) { + bool has_inode = !!sb->s_root; + kill_litter_super(sb); + + /* do not uninit dmem allocator if mount failed */ + if (has_inode) + dmem_alloc_uinit(); } static struct file_system_type dmemfs_fs_type = { diff --git a/include/linux/dmem.h b/include/linux/dmem.h index 476a82e8f252..8682d63ed43a 100644 --- a/include/linux/dmem.h +++ b/include/linux/dmem.h @@ -10,6 +10,16 @@ int dmem_region_register(int node, phys_addr_t start, phys_addr_t end); int dmem_alloc_init(unsigned long dpage_shift); void dmem_alloc_uinit(void); +phys_addr_t +dmem_alloc_pages_nodemask(int nid, nodemask_t *nodemask, unsigned int try_max, + unsigned int *result_nr); + +phys_addr_t +dmem_alloc_pages_vma(struct vm_area_struct *vma, unsigned long addr, + unsigned int try_max, unsigned int *result_nr); + +void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr); +#define dmem_free_page(addr) dmem_free_pages(addr, 1) #else static inline int dmem_reserve_init(void) { From patchwork Thu Oct 8 07:53:56 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822321 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 493D41580 for ; Thu, 8 Oct 2020 07:54:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1F8E321941 for ; Thu, 8 Oct 2020 07:54:10 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="AY/dYRPH" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728549AbgJHHyC (ORCPT ); Thu, 8 Oct 2020 03:54:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51886 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728590AbgJHHyA (ORCPT ); Thu, 8 Oct 2020 03:54:00 -0400 Received: from mail-pg1-x543.google.com (mail-pg1-x543.google.com [IPv6:2607:f8b0:4864:20::543]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AA830C061755; Thu, 8 Oct 2020 00:54:00 -0700 (PDT) Received: by mail-pg1-x543.google.com with SMTP id y14so3558695pgf.12; Thu, 08 Oct 2020 00:54:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=W3BSJ4nnntDUiPO8aKzddZBmFpEEB9T5x8/2e53DTXE=; b=AY/dYRPHOTucdyPPWq259WQSZOmMVkNpWWwJGHnhaPHpwF/Py+9Nbe4l5C9/KTgpGl hP4TnyIn5knVZk9qR8DzaPkQ2aeeUgDQT9FNMiayT5MqAK57EMEmp3xHe/cKlpou1pJM tWu0gGWXEtEt5c/Jc3BOQT2XpwhDJ9k4vniUof7lyA6rwCS7AuDtvdUFrw3P/Ls3eIZ1 YT/Oa53C1nLBdI6odiNBwCdHwnkpBkuwp4AAn7n2kPAMMbT98jHrMnxW+LupsZLUd+pj Dp88FI+XUGVZN5v4nZmnN3zkzosMvIY7DBNK1GsPLChSu4om+N4EtbvabqYb/L8/qj4J 3ICg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=W3BSJ4nnntDUiPO8aKzddZBmFpEEB9T5x8/2e53DTXE=; b=EUNcNpj2N5OlCVdJldS0MK6OuC6+fw+qi+ZmxWuTapunTdHqScurQT4/1mZ2+e0v0y N7bHmAZkcb56b9pg6XFntsVN+PUUcDUfcnXG/BJeaIwZBiQkbkyyFn8Maq6iqKfQoTqV zKzR2H9K0KV168eBU4dS7RuBQy7F2zKs+ofRA/8zLuZIyA4yb2XtQw1Qe/7dQklkNMTv Kvsqb5WD9h5RnIIBldH8VKoEe2Ewmb9Hqr986YIA3BP9T2NqcsP7dvqddZbf9d5Xk3jS rSuZLLbEBROazOg9u+3P4ohg0TUT1KzuBwhaSQDUFVtZSPSHvD4y/SyqnQWMnPclN9EL k3pA== X-Gm-Message-State: AOAM530zcWTA0uvT+NMwbYDA/ZEuSk34BvsdHR+22OAUb4JOMhhJAX5T zp3BIDQzmDtscye2FGgskhQ= X-Google-Smtp-Source: ABdhPJw/j+zr/SZvlpT3YnFskHw6EzTSV6uAMFuWg1/IrNzRRywlAHSD+6l6Nt/6uyPCQy2/Fjf4vA== X-Received: by 2002:a17:90b:4b90:: with SMTP id lr16mr7103542pjb.0.1602143640260; Thu, 08 Oct 2020 00:54:00 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.53.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:53:59 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [PATCH 06/35] dmemfs: support truncating inode down Date: Thu, 8 Oct 2020 15:53:56 +0800 Message-Id: <0e0c4b6a86d5af7cf3fa71b18b68f0c7da819f34.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang To support cut inode down, it will introduce the race between page fault handler and truncating handler as the entry to be deleted is being mapped into process's VMA in order to make page fault faster (as it's the hot path), we use RCU to sync these two handlers. When inode's size is updated, the handler makes sure the new size is visible to page fault handler who will not use truncated entry anymore and will not create new entry in that region Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- fs/dmemfs/inode.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 66 insertions(+), 1 deletion(-) diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index 21d2f951b4ea..d617494fc633 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -101,8 +101,73 @@ static const struct inode_operations dmemfs_dir_inode_operations = { .rename = simple_rename, }; +static void inode_drop_dpages(struct inode *inode, loff_t start, loff_t end); + +static int dmemfs_truncate(struct inode *inode, loff_t newsize) +{ + struct super_block *sb = inode->i_sb; + loff_t current_size; + + if (newsize & ((1 << sb->s_blocksize_bits) - 1)) + return -EINVAL; + + current_size = i_size_read(inode); + i_size_write(inode, newsize); + + if (newsize >= current_size) + return 0; + + /* it cuts the inode down */ + + /* + * we should make sure inode->i_size has been updated before + * unmapping and dropping radix entries, so that other sides + * can not create new i_mapping entry beyond inode->i_size + * and the radix entry in the truncated region is not being + * used + * + * see the comments in dmemfs_fault() + */ + synchronize_rcu(); + + /* + * should unmap all mapping first as dmem pages are freed in + * inode_drop_dpages() + * + * after that, dmem page in the truncated region is not used + * by any process + */ + unmap_mapping_range(inode->i_mapping, newsize, 0, 1); + + inode_drop_dpages(inode, newsize, LLONG_MAX); + return 0; +} + +/* + * same logic as simple_setattr but we need to handle ftruncate + * carefully as we inserted self-defined entry into radix tree + */ +static int dmemfs_setattr(struct dentry *dentry, struct iattr *iattr) +{ + struct inode *inode = dentry->d_inode; + int error; + + error = setattr_prepare(dentry, iattr); + if (error) + return error; + + if (iattr->ia_valid & ATTR_SIZE) { + error = dmemfs_truncate(inode, iattr->ia_size); + if (error) + return error; + } + setattr_copy(inode, iattr); + mark_inode_dirty(inode); + return 0; +} + static const struct inode_operations dmemfs_file_inode_operations = { - .setattr = simple_setattr, + .setattr = dmemfs_setattr, .getattr = simple_getattr, }; From patchwork Thu Oct 8 07:53:57 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822325 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id EC54313B2 for ; Thu, 8 Oct 2020 07:54:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B9C552193E for ; Thu, 8 Oct 2020 07:54:18 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="cL1aVP4n" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728701AbgJHHyP (ORCPT ); Thu, 8 Oct 2020 03:54:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51938 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728548AbgJHHyP (ORCPT ); Thu, 8 Oct 2020 03:54:15 -0400 Received: from mail-pf1-x443.google.com (mail-pf1-x443.google.com [IPv6:2607:f8b0:4864:20::443]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EEB27C061755; Thu, 8 Oct 2020 00:54:14 -0700 (PDT) Received: by mail-pf1-x443.google.com with SMTP id n14so3319456pff.6; Thu, 08 Oct 2020 00:54:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=WRQTJMxrcLsyspN0RDKoQcsucleEqYhnNKELKvYMNac=; b=cL1aVP4nJwGbuLtUrnOpoTsBcxY0Mv6rdVuaVIzpSB40thQVdU92DxjcTJaftqiRpU 7KW/12W6HcO6rj9IiyjJJlVF0sH0RNtHQASVVOvAviKHL/oU3JmW6Suv5JGsoI79tQAE 7vUoVVV23lXL74U4s3vem1krYqWnOxU27ijwf9ziMTMn2GaMixmRYJk9sV1cqeK3W8ww c4DBCjU0lEB9mh9uiByeLPPPtkoCQQtKYoTERVg0iucEOTlysxxe1HrWYjcvMmXcJn6E WTdJDhLWSmNi0LbCsUl4c2X5fBUqsiR+R98LB40+DjihERJ8PKi3sHA+Da/hEIbGMw3w zNJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=WRQTJMxrcLsyspN0RDKoQcsucleEqYhnNKELKvYMNac=; b=e8W9p+0LLjXER76Ee3iszlkMrxBkMxy8v8M1RJwxd09qjTvPICKHKDp5C4JYI0ni80 6cQa5Av3d2pQI/qAywT31L56oLX4A+53hlKR8M7AB+dw1fXbC9gDUnPZAFQGB6yyG3hL 36TonCm21MQzjGG5XtL3AFMmGGnJvgGj6fnsJOhSnlCgvUc8DdqTNFDRBEc/86quBTAx BGsrpFsmrYLQ0wydnBaKMeQA5zTmJrb4oDYgThfuqnVGbOsU3+h83qyYERPLqrJtgKd+ GSE+VkLBOyYSIuVQqzD1qcaJgHUgmnZns4xGOgRKOmerZ5VW89B9JguwFQKI6jzuIhTj LKdQ== X-Gm-Message-State: AOAM530UPnirEQONRoHuVf/VmIeVXnxzkfrK854oxbFq22REUPLA6iQJ vrQGFz6inMmWprFQXvqlm6c= X-Google-Smtp-Source: ABdhPJzzjN2WTI3Ut/2f1sheJCfE2JopXoUTUzI3IN5DS6ByVHKSAerQSFHW5+XvWm7q+olbv2SJPA== X-Received: by 2002:a63:77c4:: with SMTP id s187mr6307591pgc.303.1602143654497; Thu, 08 Oct 2020 00:54:14 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.54.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:54:13 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [PATCH 07/35] dmem: trace core functions Date: Thu, 8 Oct 2020 15:53:57 +0800 Message-Id: X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang Add tracepoints for alloc_init, alloc and free functions, that helps us to figure out what is happening inside dmem allocator Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- fs/dmemfs/Makefile | 1 + fs/dmemfs/inode.c | 5 +++ fs/dmemfs/trace.h | 54 +++++++++++++++++++++++++++++ include/trace/events/dmem.h | 68 +++++++++++++++++++++++++++++++++++++ mm/dmem.c | 6 ++++ 5 files changed, 134 insertions(+) create mode 100644 fs/dmemfs/trace.h create mode 100644 include/trace/events/dmem.h diff --git a/fs/dmemfs/Makefile b/fs/dmemfs/Makefile index 73bdc9cbc87e..0b36d03f1097 100644 --- a/fs/dmemfs/Makefile +++ b/fs/dmemfs/Makefile @@ -2,6 +2,7 @@ # # Makefile for the linux dmem-filesystem routines. # +ccflags-y += -I $(srctree)/$(src) # needed for trace events obj-$(CONFIG_DMEM_FS) += dmemfs.o dmemfs-y += inode.o diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index d617494fc633..8b0516d98ee7 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -31,6 +31,9 @@ MODULE_AUTHOR("Tencent Corporation"); MODULE_LICENSE("GPL v2"); +#define CREATE_TRACE_POINTS +#include "trace.h" + struct dmemfs_mount_opts { unsigned long dpage_size; }; @@ -339,6 +342,7 @@ radix_get_create_entry(struct vm_area_struct *vma, unsigned long fault_addr, offset += inode->i_sb->s_blocksize; dpages--; mapping->nrexceptional++; + trace_dmemfs_radix_tree_insert(index, entry); index++; } @@ -535,6 +539,7 @@ static void inode_drop_dpages(struct inode *inode, loff_t start, loff_t end) break; xa_erase(&mapping->i_pages, istart); + trace_dmemfs_radix_tree_delete(istart, pvec.pages[i]); mapping->nrexceptional--; addr = dmem_entry_to_addr(inode, pvec.pages[i]); diff --git a/fs/dmemfs/trace.h b/fs/dmemfs/trace.h new file mode 100644 index 000000000000..cc1165332e60 --- /dev/null +++ b/fs/dmemfs/trace.h @@ -0,0 +1,54 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/** + * trace.h - DesignWare Support + * + * Copyright (C) + * + * Author: Xiao Guangrong + */ + +#undef TRACE_SYSTEM +#define TRACE_SYSTEM dmemfs + +#if !defined(_TRACE_DMEMFS_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_DMEMFS_H + +#include + +DECLARE_EVENT_CLASS(dmemfs_radix_tree_class, + TP_PROTO(unsigned long index, void *rentry), + TP_ARGS(index, rentry), + + TP_STRUCT__entry( + __field(unsigned long, index) + __field(void *, rentry) + ), + + TP_fast_assign( + __entry->index = index; + __entry->rentry = rentry; + ), + + TP_printk("index %lu entry %#lx", __entry->index, + (unsigned long)__entry->rentry) +); + +DEFINE_EVENT(dmemfs_radix_tree_class, dmemfs_radix_tree_insert, + TP_PROTO(unsigned long index, void *rentry), + TP_ARGS(index, rentry) +); + +DEFINE_EVENT(dmemfs_radix_tree_class, dmemfs_radix_tree_delete, + TP_PROTO(unsigned long index, void *rentry), + TP_ARGS(index, rentry) +); +#endif + +#undef TRACE_INCLUDE_PATH +#define TRACE_INCLUDE_PATH . + +#undef TRACE_INCLUDE_FILE +#define TRACE_INCLUDE_FILE trace + +/* This part must be outside protection */ +#include diff --git a/include/trace/events/dmem.h b/include/trace/events/dmem.h new file mode 100644 index 000000000000..10d1b90a7783 --- /dev/null +++ b/include/trace/events/dmem.h @@ -0,0 +1,68 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM dmem + +#if !defined(_TRACE_DMEM_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_DMEM_H + +#include + +TRACE_EVENT(dmem_alloc_init, + TP_PROTO(unsigned long dpage_shift), + TP_ARGS(dpage_shift), + + TP_STRUCT__entry( + __field(unsigned long, dpage_shift) + ), + + TP_fast_assign( + __entry->dpage_shift = dpage_shift; + ), + + TP_printk("dpage_shift %lu", __entry->dpage_shift) +); + +TRACE_EVENT(dmem_alloc_pages_node, + TP_PROTO(phys_addr_t addr, int node, int try_max, int result_nr), + TP_ARGS(addr, node, try_max, result_nr), + + TP_STRUCT__entry( + __field(phys_addr_t, addr) + __field(int, node) + __field(int, try_max) + __field(int, result_nr) + ), + + TP_fast_assign( + __entry->addr = addr; + __entry->node = node; + __entry->try_max = try_max; + __entry->result_nr = result_nr; + ), + + TP_printk("addr %#lx node %d try_max %d result_nr %d", + (unsigned long)__entry->addr, __entry->node, + __entry->try_max, __entry->result_nr) +); + +TRACE_EVENT(dmem_free_pages, + TP_PROTO(phys_addr_t addr, int dpages_nr), + TP_ARGS(addr, dpages_nr), + + TP_STRUCT__entry( + __field(phys_addr_t, addr) + __field(int, dpages_nr) + ), + + TP_fast_assign( + __entry->addr = addr; + __entry->dpages_nr = dpages_nr; + ), + + TP_printk("addr %#lx dpages_nr %d", (unsigned long)__entry->addr, + __entry->dpages_nr) +); +#endif + +/* This part must be outside protection */ +#include diff --git a/mm/dmem.c b/mm/dmem.c index a77a064c8d59..aa34bf20f830 100644 --- a/mm/dmem.c +++ b/mm/dmem.c @@ -18,6 +18,8 @@ #include #include +#define CREATE_TRACE_POINTS +#include /* * There are two kinds of page in dmem management: * - nature page, it's the CPU's page size, i.e, 4K on x86 @@ -559,6 +561,8 @@ int dmem_alloc_init(unsigned long dpage_shift) mutex_lock(&dmem_pool.lock); + trace_dmem_alloc_init(dpage_shift); + if (dmem_pool.dpage_shift) { /* * double init on the same page size is okay @@ -686,6 +690,7 @@ dmem_alloc_pages_from_nodelist(int *nodelist, nodemask_t *nodemask, } } + trace_dmem_alloc_pages_node(addr, node, try_max, *result_nr); mutex_unlock(&dmem_pool.lock); } return addr; @@ -791,6 +796,7 @@ void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr) mutex_lock(&dmem_pool.lock); + trace_dmem_free_pages(addr, dpages_nr); WARN_ON(!dmem_pool.dpage_shift); dregion = find_dmem_region(addr, &pdnode); From patchwork Thu Oct 8 07:53:58 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822329 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A183413B2 for ; Thu, 8 Oct 2020 07:54:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7942F21897 for ; Thu, 8 Oct 2020 07:54:26 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="sND/U8OE" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728669AbgJHHyW (ORCPT ); Thu, 8 Oct 2020 03:54:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51964 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728510AbgJHHyV (ORCPT ); Thu, 8 Oct 2020 03:54:21 -0400 Received: from mail-pg1-x542.google.com (mail-pg1-x542.google.com [IPv6:2607:f8b0:4864:20::542]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 98F03C0613D2; Thu, 8 Oct 2020 00:54:20 -0700 (PDT) Received: by mail-pg1-x542.google.com with SMTP id r10so3571836pgb.10; Thu, 08 Oct 2020 00:54:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=+sV95vA2BDTcBvqCnGOEpgiyhXPcFlSLw1X7ElmDJgI=; b=sND/U8OEe6GU6RkIumK8RwtOKOvmzscP48r+GyA7ruj9/ehlYV2Wf8uuh+4YdmvBrP tKgMYCzILANziQBYJfA4FaB9t5pm7G3KdGGrqUPsCQWcFXyBpX/00Ftn27V+r8coIpCQ n6pYYalt7QUZcbKpzsYxuiulIOzfiBrS2L/CYJHFNDGULtfMCCsRURy6HMDvt7ng7jKc 9YRT4OJWgBWyaExro9EIGGXvYdS3KnOYF0Hw8MMI43/FP8eI0IGEqNJPnRBaVAouzxYF kjXQfQzVZ2lTpHHmLOOtgU0Q+zj+tXyfLk5iPm0CRvLNxDyYObv9tWBH33WTp1FANTTf ws8w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=+sV95vA2BDTcBvqCnGOEpgiyhXPcFlSLw1X7ElmDJgI=; b=V5Apqs1atWmnZ314dZTvcwniK+tZSrz1SRmEDq9SusQpKmm8gmne6uSVMq/SwpZZmX jH456op8UIBExGDY6VXe/Q/Lv+oyaR++5tGhim+Ynd78urxf+3ZF+3sqwePLg9s5Wh7i yRcfamf9kp21A5rHPbHaDbl7y+ORWzbuXL8HYJMoQLa76m8ChgNG3K/9/VsEV/LNxJrt O6wzMo/PvuVi0u1m7AsuX8aGdP6x6PHKrr+0fXWMwyr5tugRBomUyltIbTlOpUL02D3j yDgeWJGUyF3CtbEEtor9cM5DIks8E+JcNNT+kDJ1hvKtvr+st8ZaNGm3qaWY4NP2VBt0 1bRw== X-Gm-Message-State: AOAM532qDrSI7tSBmF3qPJlBLE9iTxuWYjyrL3d3aiwrS5fstU6b0vMZ 3EtROjLEVb5BgNr7bPgYA0U= X-Google-Smtp-Source: ABdhPJwE2e7CNavbiBISV2NjA6SfUvI+81Hi54CJJ1evij31Sib3fuZDPmy/ahgLBeekOnZBrmHqqA== X-Received: by 2002:a63:5d08:: with SMTP id r8mr6229132pgb.174.1602143659882; Thu, 08 Oct 2020 00:54:19 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.54.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:54:19 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [PATCH 08/35] dmem: show some statistic in debugfs Date: Thu, 8 Oct 2020 15:53:58 +0800 Message-Id: X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang Create 'dmem' directory under debugfs and show some statistic for dmem pool, track total and free dpages on dmem pool and each numa node. Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- mm/Kconfig | 9 +++++ mm/dmem.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 108 insertions(+), 1 deletion(-) diff --git a/mm/Kconfig b/mm/Kconfig index e1995da11cea..8a67c8933a42 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -235,6 +235,15 @@ config DMEM Allow reservation of memory which could be dedicated usage of dmem. It's the basics of dmemfs. +config DMEM_DEBUG_FS + bool "Enable debug information for direct memory" + depends on DMEM && DEBUG_FS + def_bool n + help + This option enables showing various statistics of direct memory + in debugfs filesystem. + +# # support for memory compaction config COMPACTION bool "Allow for memory compaction" diff --git a/mm/dmem.c b/mm/dmem.c index aa34bf20f830..6992e57d5df0 100644 --- a/mm/dmem.c +++ b/mm/dmem.c @@ -164,6 +164,103 @@ int dmem_region_register(int node, phys_addr_t start, phys_addr_t end) return 0; } +#ifdef CONFIG_DMEM_DEBUG_FS +struct debugfs_entry { + const char *name; + unsigned long offset; +}; + +#define DMEM_POOL_OFFSET(x) offsetof(struct dmem_pool, x) +#define DMEM_POOL_ENTRY(x) {__stringify(x), DMEM_POOL_OFFSET(x)} + +#define DMEM_NODE_OFFSET(x) offsetof(struct dmem_node, x) +#define DMEM_NODE_ENTRY(x) {__stringify(x), DMEM_NODE_OFFSET(x)} + +static struct debugfs_entry dmem_pool_entries[] = { + DMEM_POOL_ENTRY(region_num), + DMEM_POOL_ENTRY(registered_pages), + DMEM_POOL_ENTRY(unaligned_pages), + DMEM_POOL_ENTRY(dpage_shift), + DMEM_POOL_ENTRY(total_dpages), + DMEM_POOL_ENTRY(free_dpages), +}; + +static struct debugfs_entry dmem_node_entries[] = { + DMEM_NODE_ENTRY(total_dpages), + DMEM_NODE_ENTRY(free_dpages), +}; + +static int dmem_entry_get(void *offset, u64 *val) +{ + *val = *(u64 *)offset; + return 0; +} + +DEFINE_SIMPLE_ATTRIBUTE(dmem_fops, dmem_entry_get, NULL, "%llu\n"); + +static int dmemfs_init_debugfs_node(struct dmem_node *dnode, + struct dentry *parent) +{ + struct dentry *node_dir; + char dir_name[32]; + int i, ret = -EEXIST; + + snprintf(dir_name, sizeof(dir_name), "node%ld", + dnode - dmem_pool.nodes); + node_dir = debugfs_create_dir(dir_name, parent); + if (!node_dir) + return ret; + + for (i = 0; i < ARRAY_SIZE(dmem_node_entries); i++) + if (!debugfs_create_file(dmem_node_entries[i].name, 0444, + node_dir, (void *)dnode + dmem_node_entries[i].offset, + &dmem_fops)) + return ret; + return 0; +} + +static int dmemfs_init_debugfs(void) +{ + struct dentry *dmem_debugfs_dir; + struct dmem_node *dnode; + int i, ret = -EEXIST; + + dmem_debugfs_dir = debugfs_create_dir("dmem", NULL); + if (!dmem_debugfs_dir) + return ret; + + for (i = 0; i < ARRAY_SIZE(dmem_pool_entries); i++) + if (!debugfs_create_file(dmem_pool_entries[i].name, 0444, + dmem_debugfs_dir, + (void *)&dmem_pool + dmem_pool_entries[i].offset, + &dmem_fops)) + goto exit; + + for_each_dmem_node(dnode) { + /* + * do not create debugfs files for the node + * where no memory is available + */ + if (list_empty(&dnode->regions)) + continue; + + if (dmemfs_init_debugfs_node(dnode, dmem_debugfs_dir)) + goto exit; + } + + return 0; +exit: + debugfs_remove_recursive(dmem_debugfs_dir); + return ret; +} + +#else +static int dmemfs_init_debugfs(void) +{ + return 0; +} +#endif + #define PENALTY_FOR_DMEM_SHARED_NODE (1) static int dmem_nodeload[MAX_NUMNODES] __initdata; @@ -364,7 +461,8 @@ static int __init dmem_late_init(void) goto exit; } } - return ret; + + return dmemfs_init_debugfs(); exit: dmem_uinit(); return ret; From patchwork Thu Oct 8 07:53:59 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822349 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 48E0C13B2 for ; Thu, 8 Oct 2020 07:55:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1F7B5206F4 for ; Thu, 8 Oct 2020 07:55:00 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="I8xdPHW0" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728800AbgJHHy5 (ORCPT ); Thu, 8 Oct 2020 03:54:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51980 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728707AbgJHHy0 (ORCPT ); Thu, 8 Oct 2020 03:54:26 -0400 Received: from mail-pf1-x441.google.com (mail-pf1-x441.google.com [IPv6:2607:f8b0:4864:20::441]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 81DA5C0613D2; Thu, 8 Oct 2020 00:54:25 -0700 (PDT) Received: by mail-pf1-x441.google.com with SMTP id l126so3323057pfd.5; Thu, 08 Oct 2020 00:54:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=eGljQKzTgkdWVWcoLwMoNc4yYvwq/5X5+EymWM5SqAU=; b=I8xdPHW0M1mm8vSW+D4T0TcVydFjhpjDcTRGHXyutoitSuXG9jQZ3womXx8XzPNXe5 i7JWsMAtVjzrAe+V0760CUL994DssjAoNXzyERtrr0N8hRjOfVo1TKMOiPDBTCMTA+io K8us9lACQuoG6Db8x6Mbciu0CuWTQwrBzp1mr5E6zT78Gaj5eevYKqQKJ586UbkB4LVs PZVtNK10HGwiIbblDumpxjC/kdTEK0j/H13eZaU+OqRK82W5/PAHJLqJbBE+cUY+sKR2 WeRvO3IE6yTrBaQgWF3TaVZYjeXze1NKHkc3ClWRk064TvC0Awum5aVEMzw05ZwD67/T AyLQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=eGljQKzTgkdWVWcoLwMoNc4yYvwq/5X5+EymWM5SqAU=; b=MLHdr0qxL5M/chd+kwHqwMs9ARECpWtP8zRyhVReCfHPM4k4oFgLnprHDLkxiO9TkY 9GwNvSrl0gQnxB1HZI0stUI8bDdCVL2C1ZWS8A+2RbbMOb4fj/t7zzvf0EUwWPDNvS67 fkxg7YgQ3L997dcACmL35t28unQaKRyMd0Z9AzTWLylDgbtD9XEFELNVGY6tk+1hLpv+ hnTRKcGzDDW8T1VwgiLM/1nzkGnmoZHc026C7N9zdc7hf7p3KqXRPja6/JdFphy+ABGv tCZMYVMcknfqJVaW5Zuecnr9e2a5FuzUsqY+3LZ+WA9Zzy7ceYwyv66Ii2q78/w+TzZh 1Q5Q== X-Gm-Message-State: AOAM532VIB5yP7A5TBfy1GJyh8lEKywvoUqTL5DFZepWvx03QxH4qk56 vuXgdE+A5TAe57nvy8r/lA4= X-Google-Smtp-Source: ABdhPJyxrTr9y9iwa3fxdCr1CirZbeew7YbZv3lIdL7uc7GALHtp/3iFe7NB/PJv2EulUjVkgjDO9w== X-Received: by 2002:a17:90a:fd97:: with SMTP id cx23mr6644455pjb.3.1602143665104; Thu, 08 Oct 2020 00:54:25 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.54.22 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:54:24 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [PATCH 09/35] dmemfs: support remote access Date: Thu, 8 Oct 2020 15:53:59 +0800 Message-Id: <0b749ec1fab63b2d8ee2354f576579fe23917c26.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang It is required by ptrace_writedata and ptrace_readdata to access dmem memory remotely. The typical user is gdb, after this patch, gdb is able to read & write memory owned by the attached process Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- fs/dmemfs/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 46 insertions(+) diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index 8b0516d98ee7..4dacbf7e6844 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -367,6 +367,51 @@ static void radix_put_entry(void) rcu_read_unlock(); } +static bool check_vma_access(struct vm_area_struct *vma, int write) +{ + vm_flags_t vm_flags = write ? VM_WRITE : VM_READ; + + return !!(vm_flags & vma->vm_flags); +} + +static int +dmemfs_access_dmem(struct vm_area_struct *vma, unsigned long addr, + void *buf, int len, int write) +{ + struct inode *inode = file_inode(vma->vm_file); + struct super_block *sb = inode->i_sb; + void *entry, *maddr; + int offset, pgoff; + + if (!check_vma_access(vma, write)) + return -EACCES; + + pgoff = linear_page_index(vma, addr); + if (pgoff > (MAX_LFS_FILESIZE >> PAGE_SHIFT)) + return -EFAULT; + + entry = radix_get_create_entry(vma, addr, inode, pgoff); + if (IS_ERR(entry)) + return PTR_ERR(entry); + + offset = addr & (sb->s_blocksize - 1); + addr = dmem_entry_to_addr(inode, entry); + + /* + * it is not beyond vma's region as the vma should be aligned + * to blocksize + */ + len = min(len, (int)(sb->s_blocksize - offset)); + maddr = __va(addr); + if (write) + memcpy(maddr + offset, buf, len); + else + memcpy(buf, maddr + offset, len); + radix_put_entry(); + + return len; +} + static vm_fault_t dmemfs_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; @@ -403,6 +448,7 @@ static unsigned long dmemfs_pagesize(struct vm_area_struct *vma) static const struct vm_operations_struct dmemfs_vm_ops = { .fault = dmemfs_fault, .pagesize = dmemfs_pagesize, + .access = dmemfs_access_dmem, }; int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma) From patchwork Thu Oct 8 07:54:00 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822345 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A7141109B for ; Thu, 8 Oct 2020 07:54:55 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7FCB321924 for ; Thu, 8 Oct 2020 07:54:55 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="VzGnhu9w" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728761AbgJHHye (ORCPT ); Thu, 8 Oct 2020 03:54:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52002 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728743AbgJHHyb (ORCPT ); Thu, 8 Oct 2020 03:54:31 -0400 Received: from mail-pg1-x543.google.com (mail-pg1-x543.google.com [IPv6:2607:f8b0:4864:20::543]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 21092C0613D6; Thu, 8 Oct 2020 00:54:31 -0700 (PDT) Received: by mail-pg1-x543.google.com with SMTP id r10so3572217pgb.10; Thu, 08 Oct 2020 00:54:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=+UEtOZ7eFEPFXRY1fIK0hRtr84aEIUrt41wmuOK+xgg=; b=VzGnhu9w1XiqyxHi9u3p9HeKoHnmocTiTSFclG3q3f+eOp/0u1KUfPdyNNLCptI99d BSmJkv5jSH/V5D6I6hYRuyqzL3vAV7xsYPeGngHm8mA5L3/9HJbfXQ4Xu3wwMD51hVh7 F2sPUz9+tY10aIyW6I/MhFspsw+f2+H8LCzQhkqekjSILXEe+2AVuHf2Fbt4x0RAVeSN ybAYBeyLl01pH6U8NdVhnMlK9gNkc1BXBp/b0jRHRzxZAP4gSix8BiPqpKjdAaMgK2Fv 63VY8FpqtCqCnwLBIBtm2gAoqvXn0eEo4B1gcfbPKm+2XP3zBGEIn1pVjmdxQbYDboic V+lw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=+UEtOZ7eFEPFXRY1fIK0hRtr84aEIUrt41wmuOK+xgg=; b=dBaFKDi6IoiFMVz2ttS5CD0qwY4+LHFi7Za6iTh6V2DMzwMN1EDgN/5pCPkWkWeGf4 SWpguc+Pd3OUwIbZ3J+2NfxHJrs8pvfC5hWYL2aKt48/ZfPQ3+OyP+lQHiXaddPKtpR8 f4aElclsh+/8YCZlFjCc+rk0FiFrTMo5YSXYsH8z+w2YnoPeWZR9tLSQsLV0kdX/Dcyr zQAGa6bDSZNhUcr1CddRtvnBudwtlRK272/hxoxdT5sAgHd5/4zMMNpkgE+VYUyxjFhf GFEisf5A6FeXgYWtTVlt07XkyOR9nbrOO1z6QGfItNOBwc3Y10V9ar1IvUfeoOqMtdol RlnA== X-Gm-Message-State: AOAM531bYDqkMfOyo33I8FqgBEGa35tBF3DkowqimxwXCd78d39+3uul sRe3CBk+Z+kJkQRQJClf5/M= X-Google-Smtp-Source: ABdhPJzRNvFW9fEHPvYm2m5gg7LeIGTXsZtW9seA+qnH3C9IgNw3vgr0VPXAR6sFRlgs22r7ExeX6A== X-Received: by 2002:aa7:8e54:0:b029:142:2501:34d2 with SMTP id d20-20020aa78e540000b0290142250134d2mr6276079pfr.43.1602143669573; Thu, 08 Oct 2020 00:54:29 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.54.26 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:54:29 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [PATCH 10/35] dmemfs: introduce max_alloc_try_dpages parameter Date: Thu, 8 Oct 2020 15:54:00 +0800 Message-Id: X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang It specifies the dmem page number allocated at one time, then multiple radix entries can be created. That will relief the allocation pressure and make page fault more fast. However that could cause no dmem page mmapped to userspace even if there are some free dmem pages. Set it to 1 to completely disable this behavior. Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- fs/dmemfs/inode.c | 41 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 41 insertions(+) diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index 4dacbf7e6844..6932d73edab6 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -34,6 +34,8 @@ MODULE_LICENSE("GPL v2"); #define CREATE_TRACE_POINTS #include "trace.h" +static uint __read_mostly max_alloc_try_dpages = 1; + struct dmemfs_mount_opts { unsigned long dpage_size; }; @@ -46,6 +48,44 @@ enum dmemfs_param { Opt_dpagesize, }; +static int +max_alloc_try_dpages_set(const char *val, const struct kernel_param *kp) +{ + uint sval; + int ret; + + ret = kstrtouint(val, 0, &sval); + if (ret) + return ret; + + /* should be 1 at least */ + if (!sval) + return -EINVAL; + + max_alloc_try_dpages = sval; + return 0; +} + +static struct kernel_param_ops alloc_max_try_dpages_ops = { + .set = max_alloc_try_dpages_set, + .get = param_get_uint, +}; + +/* + * it specifies the dmem page number allocated at one time, then + * multiple radix entries can be created. That will relief the + * allocation pressure and make page fault more fast. + * + * however that could cause no dmem page mmapped to userspace + * even if there are some free dmem pages + * + * set it to 1 to completely disable this behavior + */ +fs_param_cb(max_alloc_try_dpages, &alloc_max_try_dpages_ops, + &max_alloc_try_dpages, 0644); +__MODULE_PARM_TYPE(max_alloc_try_dpages, "uint"); +MODULE_PARM_DESC(max_alloc_try_dpages, "Set the dmem page number allocated at one time, should be 1 at least"); + const struct fs_parameter_spec dmemfs_fs_parameters[] = { fsparam_string("pagesize", Opt_dpagesize), {} @@ -317,6 +357,7 @@ radix_get_create_entry(struct vm_area_struct *vma, unsigned long fault_addr, } rcu_read_unlock(); + try_dpages = min(try_dpages, max_alloc_try_dpages); /* entry does not exist, create it */ addr = dmem_alloc_pages_vma(vma, fault_addr, try_dpages, &dpages); if (!addr) { From patchwork Thu Oct 8 07:54:01 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822333 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id DE34313B2 for ; Thu, 8 Oct 2020 07:54:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B3A4321D41 for ; Thu, 8 Oct 2020 07:54:38 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="u7N268Ib" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728766AbgJHHyg (ORCPT ); Thu, 8 Oct 2020 03:54:36 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52018 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728789AbgJHHyf (ORCPT ); Thu, 8 Oct 2020 03:54:35 -0400 Received: from mail-pg1-x542.google.com (mail-pg1-x542.google.com [IPv6:2607:f8b0:4864:20::542]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 46B76C061755; Thu, 8 Oct 2020 00:54:35 -0700 (PDT) Received: by mail-pg1-x542.google.com with SMTP id x16so3598881pgj.3; Thu, 08 Oct 2020 00:54:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=1mPWQ3y+ekTW2WnE7lidu19L3jZpv/s0br4tyldlnIk=; b=u7N268Ibc1GyZOWe2Vp3htaWdl1D35g97xP1MMM5NTjbKU+P8Jzh+igerBc40RQKgh 6zkjjA313LKMHX0tL52mWjT2utbTjNyHSK17UImwVFJb7ufJEEyT11XKvdFfJ54UXAR6 1rgnTcke5IKs0jdm2kmuxF/B3qulSe+tEXX8bipSjOapveRZB0AADfMsjMNJ7XY6K35K /qjCs0/CdusCucNvXudHB+r9shZxIBqz+imd9W8a7aiO44Aa9rt2+JOKauTFc2Zjc8gL PSRVbVd3V4LwM4sQ+ws8krJvdc1h4FpHxXFjDs+PVzYf8x0hEokWQTNfCKJaH/62Bmys ZEKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=1mPWQ3y+ekTW2WnE7lidu19L3jZpv/s0br4tyldlnIk=; b=RH7nQQt+ycO91nxByu3JAuYWVC2B6wt7IloHTQKfkAWN/vkgmL5mszlwFy8m9a/HIO PdUgMQE9XMeBVhwUkrxH0sB6+Eq8gnF0e+zTlVue05K7K0W5g351N0zSxQDkNcDoeEKP q36SVTyIQ5koDIL3IdE2kvL7Kv6iX1GR0Ee06inYMhVZf4EQQWEN0QVpKuO4B6Fw3u9C YKK6jkz5RyNbHUyrcyyDUDTADQjCREey7APCBfh8Y0z4q0tkR7gyQEFEioecjY41/90M 0kfpW2Qf/hp2/wrw6I3prFgasc4ycKl2AWn36GqIrGk9M8nhK2XiDTZ6KCyInLQIvobH rlNg== X-Gm-Message-State: AOAM533Yyq7bzmXn0JSEZXSsavX3ih3I3YVsGvhgoN5BDPt1V2vRdJYW uAWHYtcg1bcLO4ASpt46oPc= X-Google-Smtp-Source: ABdhPJxvPPJlPTIc7/Sww4pvhKsLUACWnjRw8fA95iAdwKiT8jaYpm8/ZJDRZmCpaRDJhlfUCx9lNg== X-Received: by 2002:a17:90a:e64c:: with SMTP id ep12mr6750661pjb.43.1602143674884; Thu, 08 Oct 2020 00:54:34 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.54.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:54:34 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang Subject: [PATCH 11/35] mm: export mempolicy interfaces to serve dmem allocator Date: Thu, 8 Oct 2020 15:54:01 +0800 Message-Id: X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang Export interface interleave_nid() to serve dmem allocator. Signed-off-by: Yulei Zhang --- include/linux/mempolicy.h | 3 +++ mm/mempolicy.c | 4 ++-- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h index 5f1c74df264d..478966133514 100644 --- a/include/linux/mempolicy.h +++ b/include/linux/mempolicy.h @@ -139,6 +139,9 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp, struct mempolicy *get_task_policy(struct task_struct *p); struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, unsigned long addr); +struct mempolicy *get_vma_policy(struct vm_area_struct *vma, unsigned long addr); +unsigned interleave_nid(struct mempolicy *pol, struct vm_area_struct *vma, + unsigned long addr, int shift); bool vma_policy_mof(struct vm_area_struct *vma); extern void numa_default_policy(void); diff --git a/mm/mempolicy.c b/mm/mempolicy.c index eddbe4e56c73..b3103f5d9123 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1816,7 +1816,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, * freeing by another task. It is the caller's responsibility to free the * extra reference for shared policies. */ -static struct mempolicy *get_vma_policy(struct vm_area_struct *vma, +struct mempolicy *get_vma_policy(struct vm_area_struct *vma, unsigned long addr) { struct mempolicy *pol = __get_vma_policy(vma, addr); @@ -1982,7 +1982,7 @@ static unsigned offset_il_node(struct mempolicy *pol, unsigned long n) } /* Determine a node number for interleave */ -static inline unsigned interleave_nid(struct mempolicy *pol, +unsigned interleave_nid(struct mempolicy *pol, struct vm_area_struct *vma, unsigned long addr, int shift) { if (vma) { From patchwork Thu Oct 8 07:54:02 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822341 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 232D013B2 for ; Thu, 8 Oct 2020 07:54:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E33A02193E for ; Thu, 8 Oct 2020 07:54:51 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="l8Ra7Eks" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728769AbgJHHyo (ORCPT ); Thu, 8 Oct 2020 03:54:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52034 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728741AbgJHHyk (ORCPT ); Thu, 8 Oct 2020 03:54:40 -0400 Received: from mail-pl1-x643.google.com (mail-pl1-x643.google.com [IPv6:2607:f8b0:4864:20::643]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5E574C0613D2; Thu, 8 Oct 2020 00:54:40 -0700 (PDT) Received: by mail-pl1-x643.google.com with SMTP id y20so2350798pll.12; Thu, 08 Oct 2020 00:54:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=RcgMyRpYCMFPKKfeQGBLX4f9whd1oMvUnXei2mI75rM=; b=l8Ra7EksFBkM7aXJFPrA2eB3Xq6pYyiOaYXhMElfS3EIROnXctXddcVbTpovgAd9GT JXVolSyYuXt7B85Zx26ehB6xnboOlS+fMVLj3sz+vnFdiTJcz896cYCldVaJVNWQ7X1M NC/53HhkjRP00ZeAw3CNfU6HUxuemiqjcLxXynmIa8HNsMgaOaSGxCgNYGNmxluT1P/q sz3olzOcFf0tTibaHXgbGIxy3qYF/5fKbGiuIz3TSyLdNlh1eDCXgf1kqGYm5LezKRFE VeZ3haQMKU1mxBsxZCJvqoNVa75YuSXMi5uZTDU1+NF7mAr6WMvjG5zZvX5mAQOlndw4 5uAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=RcgMyRpYCMFPKKfeQGBLX4f9whd1oMvUnXei2mI75rM=; b=g9kMlp04OfUER+TGVyCzwHjg6iyEDSBhY0jZcC4Y+CrDTjlBm45fJJ3kgX3tilnAlI BpUYfgkNuI1O8DaFAfNvsEy6CjL572AgbgziB3SzQPLgYgrnU0VMgk8OJctY0pj1Jfi8 FVXyEbhtPNnS+yi8nWm2++CGN1Vx5HcUwt1mi5FjozoKrtUfKfHuG9aRB0EN/pa5N1AJ kb8gley00OUWdf7NJO7LZnhBiuN1yPcvlgI7aL09b/D7hQm7JBWrXNYuXIKTwGLv9ohB JVjCYP07v2mGoYR5wKkGhcEmtwxLmZx7/mRRC9PMURTDvYMSlX04HhnJVWKjeVRgiiy/ w4zw== X-Gm-Message-State: AOAM530mYhdtSm1D/F3srPcj16f2EjIAQwZ/mKx+Nc1t9B36aNiD4juC wAerGJEOmg2ATXHCmqQiS74= X-Google-Smtp-Source: ABdhPJywCdHXkBPuQ18NYaJDK8Gmg6c0dxeop4xWSXCznUS07DUepLINsWQt1Z3SQMFam5b8lu+S5w== X-Received: by 2002:a17:902:d888:b029:d0:cb2d:f274 with SMTP id b8-20020a170902d888b02900d0cb2df274mr6276533plz.13.1602143679895; Thu, 08 Oct 2020 00:54:39 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.54.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:54:39 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Haiwei Li Subject: [PATCH 12/35] dmem: introduce mempolicy support Date: Thu, 8 Oct 2020 15:54:02 +0800 Message-Id: <1fce243b3bcd347c951a0991a6daf0645d441e4d.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang It adds mempolicy support for dmem to allocates memory from mempolicy specified nodes. Signed-off-by: Haiwei Li Signed-off-by: Yulei Zhang --- arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 7 ++++ arch/x86/include/asm/pgtable_types.h | 13 +++++- fs/dmemfs/Kconfig | 3 ++ include/linux/pgtable.h | 7 ++++ mm/Kconfig | 3 ++ mm/dmem.c | 63 +++++++++++++++++++++++++++- 7 files changed, 94 insertions(+), 3 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 7101ac64bb20..86f3139edfc7 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -73,6 +73,7 @@ config X86 select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE select ARCH_HAS_PMEM_API if X86_64 select ARCH_HAS_PTE_DEVMAP if X86_64 + select ARCH_HAS_PTE_DMEM if X86_64 select ARCH_HAS_PTE_SPECIAL select ARCH_HAS_UACCESS_FLUSHCACHE if X86_64 select ARCH_HAS_UACCESS_MCSAFE if X86_64 && X86_MCE diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index b836138ce852..ea4554a728bc 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -453,6 +453,13 @@ static inline pmd_t pmd_mkdevmap(pmd_t pmd) return pmd_set_flags(pmd, _PAGE_DEVMAP); } +#ifdef CONFIG_ARCH_HAS_PTE_DMEM +static inline pmd_t pmd_mkdmem(pmd_t pmd) +{ + return pmd_set_flags(pmd, _PAGE_SPECIAL | _PAGE_DMEM); +} +#endif + static inline pmd_t pmd_mkhuge(pmd_t pmd) { return pmd_set_flags(pmd, _PAGE_PSE); diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 816b31c68550..ee4cae110f5c 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -23,6 +23,15 @@ #define _PAGE_BIT_SOFTW2 10 /* " */ #define _PAGE_BIT_SOFTW3 11 /* " */ #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */ +#define _PAGE_BIT_DMEM 57 /* Flag used to indicate dmem pmd. + * Since _PAGE_BIT_SPECIAL is defined + * same as _PAGE_BIT_CPA_TEST, we can + * not only use _PAGE_BIT_SPECIAL, so + * add _PAGE_BIT_DMEM to help + * indicate it. Since dmem pte will + * never be splitting, setting + * _PAGE_BIT_SPECIAL for pte is enough. + */ #define _PAGE_BIT_SOFTW4 58 /* available for programmer */ #define _PAGE_BIT_PKEY_BIT0 59 /* Protection Keys, bit 1/4 */ #define _PAGE_BIT_PKEY_BIT1 60 /* Protection Keys, bit 2/4 */ @@ -112,9 +121,11 @@ #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) #define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX) #define _PAGE_DEVMAP (_AT(u64, 1) << _PAGE_BIT_DEVMAP) +#define _PAGE_DMEM (_AT(u64, 1) << _PAGE_BIT_DMEM) #else #define _PAGE_NX (_AT(pteval_t, 0)) #define _PAGE_DEVMAP (_AT(pteval_t, 0)) +#define _PAGE_DMEM (_AT(pteval_t, 0)) #endif #define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE) @@ -128,7 +139,7 @@ #define _PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \ _PAGE_SPECIAL | _PAGE_ACCESSED | _PAGE_DIRTY | \ _PAGE_SOFT_DIRTY | _PAGE_DEVMAP | _PAGE_ENC | \ - _PAGE_UFFD_WP) + _PAGE_UFFD_WP | _PAGE_DMEM) #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE) /* diff --git a/fs/dmemfs/Kconfig b/fs/dmemfs/Kconfig index d2894a513de0..19ca3914da39 100644 --- a/fs/dmemfs/Kconfig +++ b/fs/dmemfs/Kconfig @@ -1,5 +1,8 @@ config DMEM_FS tristate "Direct Memory filesystem support" + depends on DMEM + depends on TRANSPARENT_HUGEPAGE + depends on ARCH_HAS_PTE_DMEM help dmemfs (Direct Memory filesystem) is device memory or reserved memory based filesystem. This kind of memory is special as it diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index e8cbc2e795d5..45d4c4a3e519 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1129,6 +1129,13 @@ static inline int pud_trans_unstable(pud_t *pud) #endif } +#ifndef CONFIG_ARCH_HAS_PTE_DMEM +static inline pmd_t pmd_mkdmem(pmd_t pmd) +{ + return pmd; +} +#endif + #ifndef pmd_read_atomic static inline pmd_t pmd_read_atomic(pmd_t *pmdp) { diff --git a/mm/Kconfig b/mm/Kconfig index 8a67c8933a42..09d1b1551a44 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -795,6 +795,9 @@ config IDLE_PAGE_TRACKING config ARCH_HAS_PTE_DEVMAP bool +config ARCH_HAS_PTE_DMEM + bool + config ZONE_DEVICE bool "Device memory (pmem, HMM, etc...) hotplug support" depends on MEMORY_HOTPLUG diff --git a/mm/dmem.c b/mm/dmem.c index 6992e57d5df0..2e61dbddbc62 100644 --- a/mm/dmem.c +++ b/mm/dmem.c @@ -822,6 +822,56 @@ dmem_alloc_pages_nodemask(int nid, nodemask_t *nodemask, unsigned int try_max, } EXPORT_SYMBOL(dmem_alloc_pages_nodemask); +/* Return a nodelist indicated for current node representing a mempolicy */ +static int *policy_nodelist(struct mempolicy *policy) +{ + int nd = numa_node_id(); + + switch (policy->mode) { + case MPOL_PREFERRED: + if (!(policy->flags & MPOL_F_LOCAL)) + nd = policy->v.preferred_node; + break; + case MPOL_BIND: + if (unlikely(!node_isset(nd, policy->v.nodes))) + nd = first_node(policy->v.nodes); + break; + default: + WARN_ON(1); + } + return dmem_nodelist(nd); +} + +static nodemask_t *dmem_policy_nodemask(struct mempolicy *policy) +{ + if (unlikely(policy->mode == MPOL_BIND) && + cpuset_nodemask_valid_mems_allowed(&policy->v.nodes)) + return &policy->v.nodes; + + return NULL; +} + +static void +get_mempolicy_nlist_and_nmask(struct mempolicy *pol, + struct vm_area_struct *vma, unsigned long addr, + int **nl, nodemask_t **nmask) +{ + if (pol->mode == MPOL_INTERLEAVE) { + unsigned int nid; + + /* + * we use dpage_shift to interleave numa nodes although + * multiple dpages may be allocated + */ + nid = interleave_nid(pol, vma, addr, dmem_pool.dpage_shift); + *nl = dmem_nodelist(nid); + *nmask = NULL; + } else { + *nl = policy_nodelist(pol); + *nmask = dmem_policy_nodemask(pol); + } +} + /* * dmem_alloc_pages_vma - Allocate pages for a VMA. * @@ -830,6 +880,9 @@ EXPORT_SYMBOL(dmem_alloc_pages_nodemask); * @try_max: try to allocate @try_max dpages if possible * @result_nr: allocated dpage number returned to the caller * + * This function allocates pages from dmem pool and applies a NUMA policy + * associated with the VMA. + * * Return the physical address of the first dpage allocated from dmem * pool, or 0 on failure. The allocated dpage number is filled into * @result_nr @@ -839,13 +892,19 @@ dmem_alloc_pages_vma(struct vm_area_struct *vma, unsigned long addr, unsigned int try_max, unsigned int *result_nr) { phys_addr_t phys_addr; + struct mempolicy *pol; int *nl; + nodemask_t *nmask; unsigned int cpuset_mems_cookie; retry_cpuset: - nl = dmem_nodelist(numa_node_id()); + pol = get_vma_policy(vma, addr); + cpuset_mems_cookie = read_mems_allowed_begin(); + + get_mempolicy_nlist_and_nmask(pol, vma, addr, &nl, &nmask); + mpol_cond_put(pol); - phys_addr = dmem_alloc_pages_from_nodelist(nl, NULL, try_max, + phys_addr = dmem_alloc_pages_from_nodelist(nl, nmask, try_max, result_nr); if (unlikely(!phys_addr && read_mems_allowed_retry(cpuset_mems_cookie))) goto retry_cpuset; From patchwork Thu Oct 8 07:54:03 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822337 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2C783109B for ; Thu, 8 Oct 2020 07:54:49 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id F090B21D24 for ; Thu, 8 Oct 2020 07:54:48 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ecFJwkfj" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728760AbgJHHyp (ORCPT ); Thu, 8 Oct 2020 03:54:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52050 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728741AbgJHHyp (ORCPT ); Thu, 8 Oct 2020 03:54:45 -0400 Received: from mail-pl1-x644.google.com (mail-pl1-x644.google.com [IPv6:2607:f8b0:4864:20::644]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 957AFC0613D2; Thu, 8 Oct 2020 00:54:44 -0700 (PDT) Received: by mail-pl1-x644.google.com with SMTP id x5so2372554plo.6; Thu, 08 Oct 2020 00:54:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=j2YZZCvtQrFkPVWDuwpwq7+KzOOQJEvUrhZvKk1fjMk=; b=ecFJwkfjeUhG6Ewb8+CxFWrP9cOyAEta2DZv8zhYOyGy3Q0qE4Mpwd6+BobicUOsmj 1wM+GyB01Xqf0tLzmC7m18rtkgstQQQsXIYh19vScyOTtPfnkzmvkRWBC+8jd5aDeaCx ntfkwO/fJ8k0UwhZOr2kYUmulX96oykExS1XJ8vxkyUtoyRajHf1UFP1Ir1VERT9/l+M ZZ8nNMimh6xTU/vZ6jf8rTfp1fLzzk7rLaVzhzygoThnTRU0vGG2IDednS2eoWuNeSxk XXogKMADQrgX6uSMTpJ7XaAZ81TximYoBC9VePUa2DQ8ZDnThdYUyFXg6eAeniix5U+C T8ow== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=j2YZZCvtQrFkPVWDuwpwq7+KzOOQJEvUrhZvKk1fjMk=; b=T2RRHhCPtWc5POSNt1ciIPtbjRgwUl2TFFcdq0nSr/vg39BPS6tGYpxUCFx9GH0CwM TbLpL5nP+zd8otyubMXgXnpWSQg8W7lHG+JJuiu/UCQ05cn7vZhTXChgZ9e7WvjDzgGT WCEyoyUZzxwWHT/guI2fRPZi3l/ceFihSpOyJ0HGwKnrWtgDYSu/viz7MR2OZjssiKK4 bHWku/ipJyILyDbvuTKMD5aWF3cz+AFoHwLiUGpJkhWOKnRe6fpyaAf3QqCU4DEDsYat jQkk4j0L5EwScHWpZl8uSS5uSW9nbXJr8s2qkv0Lt7GJGnb10yheYCPW24iSwH9gjuGj jMXg== X-Gm-Message-State: AOAM532wBgxxF3zUvks06qrQWzW3z7Of8wXNOqBzJzH+SAdQ/65t6C0V zNRuUBCgRJ0of02Q33lGado= X-Google-Smtp-Source: ABdhPJzREbTGVLK3lHXufvOTN8YIq/D7sgMqNQp77hhr15+N41wNgFbuZMgu0HCr6gW/c1hx19Y0AQ== X-Received: by 2002:a17:902:8d8f:b029:d0:cc02:8527 with SMTP id v15-20020a1709028d8fb02900d0cc028527mr6504505plo.33.1602143684225; Thu, 08 Oct 2020 00:54:44 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.54.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:54:43 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [PATCH 13/35] mm, dmem: introduce PFN_DMEM and pfn_t_dmem Date: Thu, 8 Oct 2020 15:54:03 +0800 Message-Id: <8c193bcb9cfd7ccce174bc4bbc9c4f5239c1f5ed.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang Introduce PFN_DMEM as a new pfn flag for dmem pfn, define it by setting (BITS_PER_LONG_LONG - 6) bit. Introduce pfn_t_dmem() helper to recognize dmem pfn. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- include/linux/pfn_t.h | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h index 2d9148221e9a..c6c0f1f84498 100644 --- a/include/linux/pfn_t.h +++ b/include/linux/pfn_t.h @@ -11,6 +11,7 @@ * PFN_MAP - pfn has a dynamic page mapping established by a device driver * PFN_SPECIAL - for CONFIG_FS_DAX_LIMITED builds to allow XIP, but not * get_user_pages + * PFN_DMEM - pfn references a dmem page */ #define PFN_FLAGS_MASK (((u64) (~PAGE_MASK)) << (BITS_PER_LONG_LONG - PAGE_SHIFT)) #define PFN_SG_CHAIN (1ULL << (BITS_PER_LONG_LONG - 1)) @@ -18,13 +19,15 @@ #define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3)) #define PFN_MAP (1ULL << (BITS_PER_LONG_LONG - 4)) #define PFN_SPECIAL (1ULL << (BITS_PER_LONG_LONG - 5)) +#define PFN_DMEM (1ULL << (BITS_PER_LONG_LONG - 6)) #define PFN_FLAGS_TRACE \ { PFN_SPECIAL, "SPECIAL" }, \ { PFN_SG_CHAIN, "SG_CHAIN" }, \ { PFN_SG_LAST, "SG_LAST" }, \ { PFN_DEV, "DEV" }, \ - { PFN_MAP, "MAP" } + { PFN_MAP, "MAP" }, \ + { PFN_DMEM, "DMEM" } static inline pfn_t __pfn_to_pfn_t(unsigned long pfn, u64 flags) { @@ -128,4 +131,16 @@ static inline bool pfn_t_special(pfn_t pfn) return false; } #endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */ + +#ifdef CONFIG_ARCH_HAS_PTE_DMEM +static inline bool pfn_t_dmem(pfn_t pfn) +{ + return (pfn.val & PFN_DMEM) == PFN_DMEM; +} +#else +static inline bool pfn_t_dmem(pfn_t pfn) +{ + return false; +} +#endif /* CONFIG_ARCH_HAS_PTE_DMEM */ #endif /* _LINUX_PFN_T_H_ */ From patchwork Thu Oct 8 07:54:04 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822373 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9A2BE109B for ; Thu, 8 Oct 2020 07:55:37 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6CE0821924 for ; Thu, 8 Oct 2020 07:55:37 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="NYkyGdYY" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728877AbgJHHzC (ORCPT ); Thu, 8 Oct 2020 03:55:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52070 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728838AbgJHHyt (ORCPT ); Thu, 8 Oct 2020 03:54:49 -0400 Received: from mail-pf1-x443.google.com (mail-pf1-x443.google.com [IPv6:2607:f8b0:4864:20::443]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 955BFC0613D3; Thu, 8 Oct 2020 00:54:49 -0700 (PDT) Received: by mail-pf1-x443.google.com with SMTP id x22so3296584pfo.12; Thu, 08 Oct 2020 00:54:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=IEHhXQUdY1ZNSJ6cw3hAQj1j8oPXbE4ltFHOuzAEm/k=; b=NYkyGdYYBu2sahdkP3l459Nd8euDPH28LupNgJ0OWQcTZQwjp/PhiM/4h1TJ5gx3gq b41NeP0TNtOBG5zqIdShwail8eMLCKNazBvRAPChNLIBefQyGiguYt0zP3kD3PGdXC5e gnUbY82y+GsG7YD/z91+bF/ZrBRE2twZmTBpIJzcUzaf4kG+9QCL3Ng8+SKLwLrknZ6b KMyAB901/+dVpdYMDm0f/UyxXCABztevhmhXbJeu41BjyzipwOvZWHmfHqVW9wkd+tCe 2BLv9OLJO4RF7vReMUHxtZjapseku4xEx6hI/gO7ruPwQlmZ5pgIwTRffBJ0ZIb9qYqz dq+Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=IEHhXQUdY1ZNSJ6cw3hAQj1j8oPXbE4ltFHOuzAEm/k=; b=hoeL3g2tTHGVW+qQ1suhWO4J7pzmV12k6cKCmsothBuVbKM3RTcBVv6wrqZhq/O0qT L/JSD7EbvBZtWMppp6CMdst1OiD5/RQbiZHUzzLC/qUJhmBfFIoHLo9U/dZJDANbj4JJ +mv/0m9KC3dFycXv9CaONx9cE9MOhZpRuyqSUceuuNEtiXP6AfCz4FpgAbsAMeJDG/z0 MQVHvSw0+WQjKfkEsXyulvkEKkYDpasrRF+NBiAhXhLBRALZq3z00fLrAEy/5J6uDrIB mKI0sKfZZGfU6s+zAjTBH4kPoJhlj8qKI83VqQ/fSsVgj410n8LaWjXt6BOjLet8jvq/ MDhQ== X-Gm-Message-State: AOAM5308i2ZUCEGvF2GluypuW4Xn6y90LioegmdYA/XVFwp19SdcC39w brBlzCmWZFWvcg4LDJsyWwc= X-Google-Smtp-Source: ABdhPJw4ya66Pmv8cUGWpsNxBGAxWYEcWj94HSID9RKL8qBVRY4KlAMcSwWGEk2h7qmlCK1fhYxgPg== X-Received: by 2002:a17:90a:cf8b:: with SMTP id i11mr7006668pju.181.1602143689231; Thu, 08 Oct 2020 00:54:49 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.54.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:54:48 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [PATCH 14/35] mm, dmem: dmem-pmd vs thp-pmd Date: Thu, 8 Oct 2020 15:54:04 +0800 Message-Id: X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang A dmem huge page is ultimately not a transparent huge page. As we decided to use pmd_special() to distinguish dmem-pmd from thp-pmd, we should make some slightly different semantics between pmd_special() and pmd_trans_huge(), just as pmd_devmap() in upstream. This distinction is especially important in some mm-core paths such as zap_pmd_range(). Explicitly mark the pmd_trans_huge() helpers that dmem needs by adding pmd_special() checks. This method could be reused in many mm-core paths. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- arch/x86/include/asm/pgtable.h | 10 +++++++++- include/linux/pgtable.h | 5 +++++ 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index ea4554a728bc..e29601cad384 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -260,7 +260,7 @@ static inline int pmd_large(pmd_t pte) /* NOTE: when predicate huge page, consider also pmd_devmap, or use pmd_large */ static inline int pmd_trans_huge(pmd_t pmd) { - return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE; + return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP|_PAGE_DMEM)) == _PAGE_PSE; } #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD @@ -276,6 +276,14 @@ static inline int has_transparent_hugepage(void) return boot_cpu_has(X86_FEATURE_PSE); } +#ifdef CONFIG_ARCH_HAS_PTE_DMEM +static inline int pmd_special(pmd_t pmd) +{ + return (pmd_val(pmd) & (_PAGE_SPECIAL | _PAGE_DMEM)) == + (_PAGE_SPECIAL | _PAGE_DMEM); +} +#endif + #ifdef CONFIG_ARCH_HAS_PTE_DEVMAP static inline int pmd_devmap(pmd_t pmd) { diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 45d4c4a3e519..1fe8546c0a7c 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1134,6 +1134,11 @@ static inline pmd_t pmd_mkdmem(pmd_t pmd) { return pmd; } + +static inline int pmd_special(pmd_t pmd) +{ + return 0; +} #endif #ifndef pmd_read_atomic From patchwork Thu Oct 8 07:54:05 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822429 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 894B6109B for ; Thu, 8 Oct 2020 07:57:03 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5FAB321924 for ; Thu, 8 Oct 2020 07:57:03 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="XHWscIgw" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728945AbgJHHzk (ORCPT ); Thu, 8 Oct 2020 03:55:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52084 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728620AbgJHHyy (ORCPT ); Thu, 8 Oct 2020 03:54:54 -0400 Received: from mail-pg1-x541.google.com (mail-pg1-x541.google.com [IPv6:2607:f8b0:4864:20::541]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 062F9C061755; Thu, 8 Oct 2020 00:54:54 -0700 (PDT) Received: by mail-pg1-x541.google.com with SMTP id u24so3608934pgi.1; Thu, 08 Oct 2020 00:54:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=m3HbgxAAw1LnL8PrjK1o24gBOd5gcGd3+lSqErW1xWY=; b=XHWscIgwaSuZzQlbFX/OPJBp8f79yzfSB/JybRdJG16eDcNmJgjnc7ZrQE1TOUAQ6H 97UMoo89Pgr0CCEL2NwB7LirEgxtSZ2yY4JB2sgMQ+ePkMXLNCmP1cnlXdP+D1yvCQjO mzt1xCdBaNAHwbX0XphJhhH3UK/KpkYKHehQuN8tqm1w6XF47mb+Y6Ss/QbB2O6umMiI dt/KydXVuix3o+JdPgYvSWbdho49YGQd4B2/nVngj8fTEUFaaribHtQZL/VTBk4sA411 HdKlwjZ1X9x8329MGcOTUMuTUQOgl43/xOUvyRpcK/SZVrE181JZaWHIMmYkm2FKcIuL kBVQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=m3HbgxAAw1LnL8PrjK1o24gBOd5gcGd3+lSqErW1xWY=; b=tQ3/9nLELhqkXY6SkU9exXkdfYml/k+hYQpjxuyXHZs3ETerHK2TlXxUDzetyCYHYZ 6dOdgAFxnMDokJHnxSKHRiIdOKHacgAO7o+9zLhqmqulVXNYrB+VmdQTAMs3tqE9AJ5U qF/yws2yE+vpZJhVi8hdmCubfDY0ixhdZokAs/WRinS6DrwWrPeUAjOm+fX3xantNewE 7ViAjELRwqwaQ9vY8x7DI/WqnN59DUraz1d6pQ+fF/TQpqAj0Xxg27/Q33zYkv3YmfLD L26L94HO4+Nw9w6+B9DK/cZEWNhiNFA4D2up5C6qJnQ80ScvVPxJiRYiG6TekY5F64Lk Wf5Q== X-Gm-Message-State: AOAM533/D+s95P1M/lWuaMa/TuaV239FLsAHcMxo+CZEMK/hvxqFPA8a X5xlwPCOr+myUWN3RXV6yek= X-Google-Smtp-Source: ABdhPJy4B1oPfec867V7HyuwKf/ugafGPnfGPMIKKLM3gAjTluU6/3ui2FYFXbSRkwOJFjbbiML0rw== X-Received: by 2002:a17:90a:aa8a:: with SMTP id l10mr2570886pjq.9.1602143693663; Thu, 08 Oct 2020 00:54:53 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.54.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:54:53 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [PATCH 15/35] mm: add pmd_special() check for pmd_trans_huge_lock() Date: Thu, 8 Oct 2020 15:54:05 +0800 Message-Id: <22298178ceab26491201b51a17f09b2283d655e8.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang As dmem-pmd had been distinguished from thp-pmd, we need to add pmd_special() such that pmd_trans_huge_lock could fetch ptl for dmem huge pmd and treat it as stable pmd. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- include/linux/huge_mm.h | 3 ++- mm/huge_memory.c | 2 +- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 8a8bc46a2432..b7381e5aafe5 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -245,7 +245,8 @@ static inline int is_swap_pmd(pmd_t pmd) static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma) { - if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) + if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) + || pmd_devmap(*pmd) || pmd_special(*pmd)) return __pmd_trans_huge_lock(pmd, vma); else return NULL; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 7ff29cc3d55c..531493a0bc82 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1862,7 +1862,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma) spinlock_t *ptl; ptl = pmd_lock(vma->vm_mm, pmd); if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || - pmd_devmap(*pmd))) + pmd_devmap(*pmd) || pmd_special(*pmd))) return ptl; spin_unlock(ptl); return NULL; From patchwork Thu Oct 8 07:54:06 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822353 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A7E31109B for ; Thu, 8 Oct 2020 07:55:11 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7BA7221927 for ; Thu, 8 Oct 2020 07:55:11 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="s5am6itE" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728906AbgJHHzE (ORCPT ); Thu, 8 Oct 2020 03:55:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52102 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728851AbgJHHy7 (ORCPT ); Thu, 8 Oct 2020 03:54:59 -0400 Received: from mail-pg1-x542.google.com (mail-pg1-x542.google.com [IPv6:2607:f8b0:4864:20::542]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5948BC0613D5; Thu, 8 Oct 2020 00:54:58 -0700 (PDT) Received: by mail-pg1-x542.google.com with SMTP id h6so3595154pgk.4; Thu, 08 Oct 2020 00:54:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=uRaztcCJnlOoX6i9scY25dALTRO/x6REIkOSQ8S7jV8=; b=s5am6itE5ChGCgbRnNVLHpQK3th4boGgvEyz/+NCiS/yVET+IQvmz+R7EGHO/mDQxu D6QKxsonLgPo5UaJ5jajc5nvXTMqkP9BsLSZ+G2OQtxFjEocWZc3ja5hXZZSByAASU9f T9P3CvPF7HkFcUaAHkROPIzZyu4ZASaPuuaPA+cY7/LUOrvd7OlgqxQGkLZXT6PcNUPI 6Bb9pcCgbv2ayWI+WAdFai8nslKRJSO79E7n36X70DvCPLAfLW9x9bnemZMh45jd3Qpj rQWRceAcpwlVyXqe9Bm42qUm44gv23i12iV1YgOtKTOlcOMvVF7/qobS4kSTeBSloJOX exIg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=uRaztcCJnlOoX6i9scY25dALTRO/x6REIkOSQ8S7jV8=; b=YqjIjE8JLG/awOH4Y4eayVD9OTp2Wr/kerXlMKYaDF/SVRXF6LdTW5AjHBxpKTXsIJ 1RXMUOY6i453X0NeEDF4cxFVo3cMIBcpu3rslL9rjdLOVFNSNdPhJqn2fYmWbiEkx6J3 8Kv19S6DuH+YtGyEcgYsedfqOmrGSUOmq/GXrGXhXWpaI6hcgpQGji6GHc7tCYS1felJ eF/j5r4Gyk4UoG2FxZweBZVTGbe3uAmD0PN5JMXARxQJweLvBMy6xIXXm9yShO6az9rx PKHZTcgZi8c45X0KXnmo3PeGl9dgJMAhIBNGvylPPNA+7IhYzDDQVqP48bsJPS+JFiNd FW0w== X-Gm-Message-State: AOAM531rFUKXoSjfl9nt3me3TopbF5RF1+Smc4YstiVjzhcX5ABuQiHH AsRfBwlrSua0nNQVgxKf1rU= X-Google-Smtp-Source: ABdhPJxmneky4Nkjo5WcAH3W3ovFYU42Ol4bsicbBn4KnNNC4bw5RMfh4unJXvihHs0zVvq7rxpVdQ== X-Received: by 2002:a62:93:0:b029:13e:d13d:a085 with SMTP id 141-20020a6200930000b029013ed13da085mr6187059pfa.28.1602143697958; Thu, 08 Oct 2020 00:54:57 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.54.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:54:57 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [PATCH 16/35] dmemfs: introduce ->split() to dmemfs_vm_ops Date: Thu, 8 Oct 2020 15:54:06 +0800 Message-Id: X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang It is required by __split_vma() to adjust vma. munmap() which create hole unaligned to pagesize in dmemfs-mapping should be forbidden. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- fs/dmemfs/inode.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index 6932d73edab6..e37498c00497 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -453,6 +453,13 @@ dmemfs_access_dmem(struct vm_area_struct *vma, unsigned long addr, return len; } +static int dmemfs_split(struct vm_area_struct *vma, unsigned long addr) +{ + if (addr & (dmem_page_size(file_inode(vma->vm_file)) - 1)) + return -EINVAL; + return 0; +} + static vm_fault_t dmemfs_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; @@ -487,6 +494,7 @@ static unsigned long dmemfs_pagesize(struct vm_area_struct *vma) } static const struct vm_operations_struct dmemfs_vm_ops = { + .split = dmemfs_split, .fault = dmemfs_fault, .pagesize = dmemfs_pagesize, .access = dmemfs_access_dmem, From patchwork Thu Oct 8 07:54:07 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822369 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 9B3B713B2 for ; Thu, 8 Oct 2020 07:55:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 75F4121897 for ; Thu, 8 Oct 2020 07:55:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="hRkiLiH0" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728936AbgJHHzb (ORCPT ); Thu, 8 Oct 2020 03:55:31 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52118 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728871AbgJHHzC (ORCPT ); Thu, 8 Oct 2020 03:55:02 -0400 Received: from mail-pg1-x544.google.com (mail-pg1-x544.google.com [IPv6:2607:f8b0:4864:20::544]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7B6DEC061755; Thu, 8 Oct 2020 00:55:02 -0700 (PDT) Received: by mail-pg1-x544.google.com with SMTP id r21so228945pgj.5; Thu, 08 Oct 2020 00:55:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=kQ4dDkDLHOzlM4Ta9v469E78PD4zk8EspT9PQyARZLY=; b=hRkiLiH0+aLvUr+ptwCCYLTMzxyBKI3GYqDPeHrkc4hbWYiT6E/f2CqdioEdWxlKj+ 2XiFq/vp8j5rdWQq1bCva2BQQpCsf/mR9MXEyGcEdUoNqlWzSpJiOasatdzC9aTMVqHz WYIetmFhyOW3djjHcdFyntUg1yDkLhjuhonsw56YGUVEVwX/8duLgqSG23irVJe0XyzC N4fjzcLferZEl6fW/7nGIbuTgs4reXHcLDd7ev3xioemq987V0KlJv+lDJ47Sx/afilE JJ1aDvayWG8IeG0hedKEzoGZsVZQSiaQ3vkB95yyhMJFRWNEnRETtN7+hj8hada0icYr jZMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=kQ4dDkDLHOzlM4Ta9v469E78PD4zk8EspT9PQyARZLY=; b=TntB9Do/uO34omUbIFPNXnsxmE1BeE2wytW31XPUHlQY1I8XPn5XG7uI82bO1BYgzp BCcJRLSlzNxvJuQ4jaz1X71NupYbl/CCnR5dblwlCMKDufiYT7+DojIOYFU0H2Uc4Nya 8q9L+GH21WmJg/2e3xIN7mFgAX5q4KhIaMmJbw7MoN6C7cTQjsWhc/Xfr5u8LiS4Een3 kjKwz28br+xp5+Ltt9OxVGOKWLUD62cCV26K/ujyPuHc5seJ4aNPF7pzve7WisvvH6RL AosyoHu+683fctiE+o72HrC0yKQDKCD22yFuSm3uLrzA4j+9Q9B0k+DhxB2M0bkALfMX iujA== X-Gm-Message-State: AOAM533X4HLxgDzS0al7WePZNm0q3MmTtnFdlE3F5Qr+tSTGhkEAOPKR keiMoGPdY1SERYSdAhxwFTQ= X-Google-Smtp-Source: ABdhPJzSNndYJoD0AFRNoT9OVm9MNcFwGIdF5qxtt+E7GlTYxmxn9Yu2qg+zTTrOA5HQ4PGvLzk2UQ== X-Received: by 2002:a17:90a:ab0b:: with SMTP id m11mr7039002pjq.197.1602143702136; Thu, 08 Oct 2020 00:55:02 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.54.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:55:01 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [PATCH 17/35] mm, dmemfs: support unmap_page_range() for dmemfs pmd Date: Thu, 8 Oct 2020 15:54:07 +0800 Message-Id: <267ac5a8b5f650d14667559495f6028a568abdd9.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang It is required by munmap() for dmemfs mapping. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- mm/huge_memory.c | 2 ++ mm/memory.c | 8 +++++--- 2 files changed, 7 insertions(+), 3 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 531493a0bc82..73af337b454e 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1636,6 +1636,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, spin_unlock(ptl); if (is_huge_zero_pmd(orig_pmd)) tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE); + } else if (pmd_special(orig_pmd)) { + spin_unlock(ptl); } else if (is_huge_zero_pmd(orig_pmd)) { zap_deposited_table(tlb->mm, pmd); spin_unlock(ptl); diff --git a/mm/memory.c b/mm/memory.c index 469af373ae76..2d2c0f8a966b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1178,10 +1178,12 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb, pmd = pmd_offset(pud, addr); do { next = pmd_addr_end(addr, end); - if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) { - if (next - addr != HPAGE_PMD_SIZE) + if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || + pmd_devmap(*pmd) || pmd_special(*pmd)) { + if (next - addr != HPAGE_PMD_SIZE) { + VM_BUG_ON(pmd_special(*pmd)); __split_huge_pmd(vma, pmd, addr, false, NULL); - else if (zap_huge_pmd(tlb, vma, pmd, addr)) + } else if (zap_huge_pmd(tlb, vma, pmd, addr)) goto next; /* fall through */ } From patchwork Thu Oct 8 07:54:08 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822355 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BBF9D109B for ; Thu, 8 Oct 2020 07:55:14 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 93FC621924 for ; Thu, 8 Oct 2020 07:55:14 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="YMmnkQY3" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728944AbgJHHzM (ORCPT ); Thu, 8 Oct 2020 03:55:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52150 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728945AbgJHHzM (ORCPT ); Thu, 8 Oct 2020 03:55:12 -0400 Received: from mail-pl1-x641.google.com (mail-pl1-x641.google.com [IPv6:2607:f8b0:4864:20::641]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A32CAC0613D7; Thu, 8 Oct 2020 00:55:06 -0700 (PDT) Received: by mail-pl1-x641.google.com with SMTP id t18so2377986plo.1; Thu, 08 Oct 2020 00:55:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=c062z9g3gnR/I/VUW1enDVKC1CGF86sFMeBq5UIH9yE=; b=YMmnkQY3sdDOJ3lcgij0sPB7c7ws9B6hcVk1g8V14rp2oxxDebPKq+JUdKAtN4cxzs ebywEZ0+hJZUFFVGTNaWRqx5PueuG6qdWRvf+CDAP0N1zQ5wCM5HyFDkbjFskuP5YA+r QGSHPQj2HIxoU578u/gHISDs1ANCxD42EDBnZe44TL3/1MnICrEC1x0rU2Us0scHsDf3 jCPWqUl0bbvjvaim6QqJSf2svvXMY5Rk0pWiakSCe3BU3akz8vFMtljZHwUl9Xq4qkN7 jZikO0yA8kldHbQHXVbc7yb5OlzxOwIvcOYDOBFJpaDjNHlMmAEBHrKBD6zY1zQ5MlGn ZSig== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=c062z9g3gnR/I/VUW1enDVKC1CGF86sFMeBq5UIH9yE=; b=A5xjseC8YoHBBuJG62CjlwSD/bZR/8KWkljFGHFICCvPxEOA0iokWW9eZ5DLZ6oZO/ QkuZMUPUM6utNBfbDU95CGgghjWOYh/RMjq/IN7/KMB80pEzBNsMKQ2wD9Yr2/P9SYAk hb7KO+zc/kpgYiUAywlw3YXZmriv7pMmwSBhv8YK1vsA46+MXsQR+gpDaIHMshE36u0S nIJKIV6rxpxUWPbGlovWAIR4ZRPNHSiBjfP58n6kAT5zFvlensliJ7BPRs9jsIX/Nmmp dZb1wNM8vjs/EQmMLqZVPfNP0901CfUfVGvxsnFMadO6oxAI8/tMRxbphJLLz0AdDs2L E0pg== X-Gm-Message-State: AOAM530F6r5OL3+FesqlY1Y/NQFnUyox499DT8wKbLlLBIcmE1C7AMFb CjCJujF4Kb6SsZ4puWuUJieP4CiqDquopQ== X-Google-Smtp-Source: ABdhPJyK3jBEUk5kmaNd3EMLrw1C6cs+jFaf0qoSfvtLS36XRmTpxY9vYJHnrbHwlszIGh7v3DmGQw== X-Received: by 2002:a17:902:a9cc:b029:d3:77f7:3ca9 with SMTP id b12-20020a170902a9ccb02900d377f73ca9mr6536321plr.75.1602143706251; Thu, 08 Oct 2020 00:55:06 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.55.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:55:05 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [PATCH 18/35] mm: follow_pmd_mask() for dmem huge pmd Date: Thu, 8 Oct 2020 15:54:08 +0800 Message-Id: <25a50b534bb73164dcad1be1f7b9c48756445c3a.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang While follow_pmd_mask(), dmem huge pmd should be recognized and return error pointer of '-EEXIST' to indicate that proper page table entry exists in pmd special but no corresponding struct page, because dmem page means non struct page backend. We update pmd if foll_flags takes FOLL_TOUCH. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- mm/gup.c | 42 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) diff --git a/mm/gup.c b/mm/gup.c index e5739a1974d5..726ffc5b0ea9 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -380,6 +380,42 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, return -EEXIST; } +static struct page * +follow_special_pmd(struct vm_area_struct *vma, unsigned long address, + pmd_t *pmd, unsigned int flags) +{ + spinlock_t *ptl; + + if (flags & FOLL_DUMP) + /* Avoid special (like zero) pages in core dumps */ + return ERR_PTR(-EFAULT); + + /* No page to get reference */ + if (flags & FOLL_GET) + return ERR_PTR(-EFAULT); + + if (flags & FOLL_TOUCH) { + pmd_t _pmd; + + ptl = pmd_lock(vma->vm_mm, pmd); + if (!pmd_special(*pmd)) { + spin_unlock(ptl); + return NULL; + } + _pmd = pmd_mkyoung(*pmd); + if (flags & FOLL_WRITE) + _pmd = pmd_mkdirty(_pmd); + if (pmdp_set_access_flags(vma, address & HPAGE_PMD_MASK, + pmd, _pmd, + flags & FOLL_WRITE)) + update_mmu_cache_pmd(vma, address, pmd); + spin_unlock(ptl); + } + + /* Proper page table entry exists, but no corresponding struct page */ + return ERR_PTR(-EEXIST); +} + /* * FOLL_FORCE can write to even unwritable pte's, but only * after we've gone through a COW cycle and they are dirty. @@ -564,6 +600,12 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma, return page; return no_page_table(vma, flags); } + if (pmd_special(*pmd)) { + page = follow_special_pmd(vma, address, pmd, flags); + if (page) + return page; + return no_page_table(vma, flags); + } if (is_hugepd(__hugepd(pmd_val(pmdval)))) { page = follow_huge_pd(vma, address, __hugepd(pmd_val(pmdval)), flags, From patchwork Thu Oct 8 07:54:09 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822363 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BA7AC109B for ; Thu, 8 Oct 2020 07:55:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8F5FE21924 for ; Thu, 8 Oct 2020 07:55:25 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="T1ETnzwe" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728959AbgJHHzO (ORCPT ); Thu, 8 Oct 2020 03:55:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52158 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728941AbgJHHzM (ORCPT ); Thu, 8 Oct 2020 03:55:12 -0400 Received: from mail-pf1-x441.google.com (mail-pf1-x441.google.com [IPv6:2607:f8b0:4864:20::441]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E19A1C0613DA; Thu, 8 Oct 2020 00:55:10 -0700 (PDT) Received: by mail-pf1-x441.google.com with SMTP id a200so3307863pfa.10; Thu, 08 Oct 2020 00:55:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=DUrjITBbWyKSwaW9zQIzPOaWe1Yh8vSIfndNH38vsy8=; b=T1ETnzwey8YPOxGtwjIvQMYqUFJ6Yi/NeAmqkHW6jB3lhrgcCG4Ue66deFuERkIuY7 uLRPF/PH/dxB+CW2GtBntMfdirk1/9pVZ3c5mbSRFS5rAS1OibLGFlxp6MzSGKUSSIxi HSZQlklFFfs6WYtYwx9ymKKaUEoiHE8qREXnZjP/raMIRdJqQCHQsXLsx2A9lTtJdoAt ASXT0kVDxP4bOL9EDNMcVv+7o8kwfgxJnwIfQOVEr7FXVkiML8VRGXCV5KAXNIbp7loI vJMykbF6OQ2wixFtMfiqnn5RWiSWcZyShzjh5RAFJ+fbv58G1WAMgnwmTpTX0jrV1i3l GYRA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=DUrjITBbWyKSwaW9zQIzPOaWe1Yh8vSIfndNH38vsy8=; b=XraYdynveT4ha/z9+D4/5imqZJd2t/hxGtWd1pDKEGn5o6uRq4RjKJ94bbw0TkuYbq ZiEpz3jEDxn9K+Ettvzeyea+7i49AEOHXGt6Zj+/4e6sQK7PlqkiGzFh0f6c2Or7Osgk nCD6AztTTDpd59/zo48Sp+1Fvpq8ygFRjPOarlIRVf2XoMEqPQKsxmriF8igsflVjiGC oziQ7tPf25ycf6mlMt3dh3M8rC4ZgdRX1KNHIKLNa79auAUrtMIB1mxScmLzZ+zDB+8X OLuiXiXh9khucIjuZVCr2Ylg6uzJQnU1Lv/apPwYTfq1Ld0lmKsrRWBU09c7qUJnhb5e AeLA== X-Gm-Message-State: AOAM532ZILlWlhWVBhlwXaZAWxDE0J+qz3IneyGCZ5VH6kdCzr8UCO5e xeSxjWlizbs0H2xD5ifLNfmE+HMmFheI3A== X-Google-Smtp-Source: ABdhPJxn+zqF9duV6xKr3ksLqmnqartE3btzay1Xcm4zi3toaWYS9MK2OGf/BZtI53as57lPEh8HUQ== X-Received: by 2002:a17:90a:9f8e:: with SMTP id o14mr6872630pjp.103.1602143710505; Thu, 08 Oct 2020 00:55:10 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.55.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:55:10 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [PATCH 19/35] mm: gup_huge_pmd() for dmem huge pmd Date: Thu, 8 Oct 2020 15:54:09 +0800 Message-Id: <184340f563959728d5e4e3d23463f54b797040b6.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang Add pmd_special() check in gup_huge_pmd() to support dmem huge pmd. GUP will return zero if enconter dmem page, and we could handle it outside GUP routine. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- mm/gup.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/mm/gup.c b/mm/gup.c index 726ffc5b0ea9..a8edbb6a2b2f 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2440,6 +2440,10 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr, if (!pmd_access_permitted(orig, flags & FOLL_WRITE)) return 0; + /* Bypass dmem huge pmd. It will be handled in outside routine. */ + if (pmd_special(orig)) + return 0; + if (pmd_devmap(orig)) { if (unlikely(flags & FOLL_LONGTERM)) return 0; @@ -2542,7 +2546,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, return 0; if (unlikely(pmd_trans_huge(pmd) || pmd_huge(pmd) || - pmd_devmap(pmd))) { + pmd_devmap(pmd) || pmd_special(pmd))) { /* * NUMA hinting faults need to be handled in the GUP * slowpath for accounting purposes and so that they From patchwork Thu Oct 8 07:54:10 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822359 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 2FEF5109B for ; Thu, 8 Oct 2020 07:55:17 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 04AA32193E for ; Thu, 8 Oct 2020 07:55:16 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="lP17x+FN" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728977AbgJHHzP (ORCPT ); Thu, 8 Oct 2020 03:55:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52170 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728966AbgJHHzP (ORCPT ); Thu, 8 Oct 2020 03:55:15 -0400 Received: from mail-pg1-x544.google.com (mail-pg1-x544.google.com [IPv6:2607:f8b0:4864:20::544]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EECFCC061755; Thu, 8 Oct 2020 00:55:14 -0700 (PDT) Received: by mail-pg1-x544.google.com with SMTP id i2so3583857pgh.7; Thu, 08 Oct 2020 00:55:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=RHYtPFeMZWLc18KP0oQeXS/4Vq6HE5iYv+578W57cug=; b=lP17x+FNFFFLix4HphKfiyj2fY+SOeaElAzyliE0QDst0Kgy0VTVEi5eU1ZRuOFA+u yOmLDKHUfkxwdJJjroxDWyiYbTUf8BbHu1OPFam/fGmSgzw0yhmI+gRg0SPBPrPyyAUn OcCbLhIHO72KoDqmPcloUyiUWPsaguNmqwHAmJtvhUyoDyI2pFgZ8FDRDp8qiLMn9XXn 1GHx99yph5JXE87p/AXyilWvNUHJZ3Ze3esClWvBHYlWBzrDXJQMCOOY9f2xYAC6nyB5 lCSPkcQKhpY5Z3IbxanKzwHgmkVfmeKJ0XMLHQa8iHsSfOy8MShzyFupK9sh6NlNjid4 JPdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=RHYtPFeMZWLc18KP0oQeXS/4Vq6HE5iYv+578W57cug=; b=HZCHqQYFQHFGh33+cdAPU4a/41O2nQp6kzd/htQwQRMy8+fQPhyFkGGfMQ6y2LZ7Kr yuKjZzDiLWexhMNthM931TUswXM+tjYidcH6Q6DYEb3Dk7ALEUHLqMpYc3KFj7vx62Af 4ClqsVJ2Bsxj1ZyMv3LLV8t9CgTdli/RGnFYTxDhcCAG5j8a+3du0QEONCdowdOziArs Uuq+UVQYrLL6bIvA3g/xvq9AOo/8WBawdacoksCmn3FoZHNXVxXWS/rUOfpNDrWxqWZn vCKNJwxDcjoSPw21iB+RbGDA0fzwcjnDUNDTjDzW5zSY0w6FbrZ2yb+ZVI/tzGkU9fxf MclA== X-Gm-Message-State: AOAM533VOpkBmwfdBXDXFTgWl4WCj7EwwpmXYnYUmJuD1VYHVYObuhQi Y4B69qgemB1mG5DFwgXGlDM= X-Google-Smtp-Source: ABdhPJy8bCqYDqGzlAUHaJ2Dg6h1nj0iX25rGQODuf6Kj0hKjwHAw5+HG0Wozp7TjnTg0+x3fk+34w== X-Received: by 2002:a17:90a:a09:: with SMTP id o9mr6435104pjo.134.1602143714612; Thu, 08 Oct 2020 00:55:14 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.55.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:55:14 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [PATCH 20/35] mm: support dmem huge pmd for vmf_insert_pfn_pmd() Date: Thu, 8 Oct 2020 15:54:10 +0800 Message-Id: <7325d4c99cd3bbcd74fac182d06ca17f78c454a5.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang Since vmf_insert_pfn_pmd will BUG_ON non-pmd-devmap, we make pfn dmem pass the check. Dmem huge pmd will be marked with _PAGE_SPECIAL and _PAGE_DMEM, so that follow_pfn() could recognize it. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- mm/huge_memory.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 73af337b454e..a24601c93713 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -781,6 +781,8 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, entry = pmd_mkhuge(pfn_t_pmd(pfn, prot)); if (pfn_t_devmap(pfn)) entry = pmd_mkdevmap(entry); + else if (pfn_t_dmem(pfn)) + entry = pmd_mkdmem(entry); if (write) { entry = pmd_mkyoung(pmd_mkdirty(entry)); entry = maybe_pmd_mkwrite(entry, vma); @@ -827,7 +829,7 @@ vm_fault_t vmf_insert_pfn_pmd_prot(struct vm_fault *vmf, pfn_t pfn, * can't support a 'special' bit. */ BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && - !pfn_t_devmap(pfn)); + !pfn_t_devmap(pfn) && !pfn_t_dmem(pfn)); BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) == (VM_PFNMAP|VM_MIXEDMAP)); BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags)); From patchwork Thu Oct 8 07:54:11 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822437 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D68F8109B for ; Thu, 8 Oct 2020 07:57:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id AD8772193E for ; Thu, 8 Oct 2020 07:57:15 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="feR3Hp4M" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729013AbgJHH5O (ORCPT ); Thu, 8 Oct 2020 03:57:14 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52184 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729000AbgJHHzT (ORCPT ); Thu, 8 Oct 2020 03:55:19 -0400 Received: from mail-pf1-x444.google.com (mail-pf1-x444.google.com [IPv6:2607:f8b0:4864:20::444]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 55784C061755; Thu, 8 Oct 2020 00:55:19 -0700 (PDT) Received: by mail-pf1-x444.google.com with SMTP id k8so3330846pfk.2; Thu, 08 Oct 2020 00:55:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=GAQmo3emgE7wGALCr56xbTRFWPok3r8JHn097oLTrgo=; b=feR3Hp4Mm0c11NezfLW8RmQ4nPtConWWWECi6vPDapgIc/SYfqzk1dmw2SvgMuHvkv spxvknKBvJBMPz0sV+Zd6yiR/GVE76Qc0zVt4ora65VHgmvWOEOoLtu+fWeoUqbkZFQF ZE7oR4PsRgqLQ5y/ReaMWKS2S4khTxfXyEPHBiKhA6MqGPxwmJ+At0ncMzUWgikHZ59U Jccx0kkt1U7OEylV2ok/m/AZt9s1UtU2NPlVDVs4FsWZtlC4V6N4cz+J9rE5u379tHKf yA1KVQR2wDwRXA7XpsJEyKe3piteYnWd9gG0vtRbL48dqdsrBJ65xUI6IYRFPz/EenIj VAjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=GAQmo3emgE7wGALCr56xbTRFWPok3r8JHn097oLTrgo=; b=K3JeW0Y/ywY/J7OOnkTBIq7NdforhYMjmTJSKldL47ksX7HRYG46cViOmCLGxIYfC3 Y/OJB9EE5BZ0Lex71qeLEELCHPlADf/YaQ/fQM/I3l9gMBDs0mHWI33VFXWGQaPjbMSh afo9KUxhWMJmKIqaFs0qKNq68MpYNOBeBhqTq1b1GfkWBi2Zc+knOngB8vB9ddK9wp2/ dFneJLvTD9Z3E7tLotmxc9oMEWvfhFhG0RlKJq4ll/QeUsNI5GPSeEGEi+rsa6VA5Fpx //6qGr+HZr6EZ4SCPYLl49tc0fDhFu95VZ4rwCt5DfSHZc3BhzLcSNGIFXgQjCQ0WNVM QiUg== X-Gm-Message-State: AOAM532e6gpc6YmTQ2qeaVYPcHtd1Uu+TsmmACJMF3ZKcL+kAjOs80gE NdF0FURDrBnzYJSYFQBJPp0= X-Google-Smtp-Source: ABdhPJzmgKs9G1dEB0PWPDBmnTSeOR/fYfsVYYngL49TYYKJ8bmnoC2kfQISr+E1jQmIAcBcwTnBXg== X-Received: by 2002:a17:90a:7d16:: with SMTP id g22mr6886600pjl.135.1602143718936; Thu, 08 Oct 2020 00:55:18 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.55.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:55:18 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [PATCH 21/35] mm: support dmem huge pmd for follow_pfn() Date: Thu, 8 Oct 2020 15:54:11 +0800 Message-Id: <5c508795a2f262e80cc3855853eba4042c863a3f.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang follow_pfn() will get pfn of pmd if huge pmd is encountered. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- mm/memory.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 2d2c0f8a966b..ca42a6e56e9b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4644,15 +4644,23 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address, int ret = -EINVAL; spinlock_t *ptl; pte_t *ptep; + pmd_t *pmdp = NULL; if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) return ret; - ret = follow_pte(vma->vm_mm, address, &ptep, &ptl); + ret = follow_pte_pmd(vma->vm_mm, address, NULL, &ptep, &pmdp, &ptl); if (ret) return ret; - *pfn = pte_pfn(*ptep); - pte_unmap_unlock(ptep, ptl); + + if (pmdp) { + *pfn = pmd_pfn(*pmdp) + ((address & ~PMD_MASK) >> PAGE_SHIFT); + spin_unlock(ptl); + } else { + *pfn = pte_pfn(*ptep); + pte_unmap_unlock(ptep, ptl); + } + return 0; } EXPORT_SYMBOL(follow_pfn); From patchwork Thu Oct 8 07:54:12 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822433 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 32B5113B2 for ; Thu, 8 Oct 2020 07:57:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0ADC821924 for ; Thu, 8 Oct 2020 07:57:10 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="SQ2f9dH1" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728814AbgJHH5H (ORCPT ); Thu, 8 Oct 2020 03:57:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52206 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729017AbgJHHzY (ORCPT ); Thu, 8 Oct 2020 03:55:24 -0400 Received: from mail-pg1-x542.google.com (mail-pg1-x542.google.com [IPv6:2607:f8b0:4864:20::542]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 63774C0613D5; Thu, 8 Oct 2020 00:55:23 -0700 (PDT) Received: by mail-pg1-x542.google.com with SMTP id h6so3596156pgk.4; Thu, 08 Oct 2020 00:55:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=Q63UGNU50PxJwBeAw35OlXFM2zT0ydG7N+0qP8I/gHo=; b=SQ2f9dH1UsoGi1f/XZ7ldaen6tWpM4BB3XGMmJ7K/NG22QSMmTLZu6+XrO8owZNNCA Sag8KMDiI/xIiMonNj2iFUHigJcVao+uCWBhgm2p9aLuin9Lp9hRdEJEtTpqu9oNuo2x nLMeX5Z+LjndXkm7XET64/pO6fW7Hmid8g0nOkftuuslqiWXTlOPP/+D5mnT56ARk4Sl FBI3YJulFj83gaDSMHT4/GS3HBbcLKb8v8f+CdQ4co7oFUQEyPk1L3foWAcn+VyVhAl5 cIJrejQMadXEQC6kWkYMlQNj9k4aNiM5cVwDpsCbhpNTSQ09QMLbQGkpsg8ku5AjhKsZ KIYQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=Q63UGNU50PxJwBeAw35OlXFM2zT0ydG7N+0qP8I/gHo=; b=fetPs220nvH1vHD6nZpkyVd74iLFO92IHsNnwc1hwOG97j4qFWMp7X9Le08aFDbGAs 0jBKvKIpgsA/MUpa0oss3WQA0E1E5gi7E29VjdwkIQIXaAroU+M9hg/3k1+zZIfEdcwV 83PQdhXqJXkjGys3GRn72E3PFvZzDp9AfUEm0h8vdXC8BF6aid6kvCyCUrULhk1eMmW5 ogWa+PehnDawNfYWufdPqAa9O4chOfkA5JVLKmnXVNDiyXa4vVEWiVDS07BuQVJl8WN8 /iJTDa2OQc6vTYAVWge8uwnNrHAOdo+yLaxPliVKGfcbYAXd/dHyBg3hx8jW1tAGBjeA gveg== X-Gm-Message-State: AOAM533x/TIXAupN/TW6Z6scFKXxYKnNMNOwsFHUgNuVUQuVCRK9lHO+ SXE5gkN7IxyVPND5HYktRBA= X-Google-Smtp-Source: ABdhPJxpDi+MtXox+1vt3agOi/zxClZUn41exKjvJWiEd5KpI2sW06SKrQRX7PWXhlZeBzN93ix7PA== X-Received: by 2002:a17:90a:a88:: with SMTP id 8mr6844949pjw.105.1602143723037; Thu, 08 Oct 2020 00:55:23 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.55.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:55:22 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [PATCH 22/35] kvm, x86: Distinguish dmemfs page from mmio page Date: Thu, 8 Oct 2020 15:54:12 +0800 Message-Id: X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang Dmem page is pfn invalid but not mmio. Support cacheable dmem page for kvm. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- arch/x86/kvm/mmu/mmu.c | 5 +++-- include/linux/dmem.h | 7 +++++++ mm/dmem.c | 7 +++++++ 3 files changed, 17 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 71aa3da2a0b7..0115c1767063 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -41,6 +41,7 @@ #include #include #include +#include #include #include @@ -2962,9 +2963,9 @@ static bool kvm_is_mmio_pfn(kvm_pfn_t pfn) */ (!pat_enabled() || pat_pfn_immune_to_uc_mtrr(pfn)); - return !e820__mapped_raw_any(pfn_to_hpa(pfn), + return (!e820__mapped_raw_any(pfn_to_hpa(pfn), pfn_to_hpa(pfn + 1) - 1, - E820_TYPE_RAM); + E820_TYPE_RAM)) || (!is_dmem_pfn(pfn)); } /* Bits which may be returned by set_spte() */ diff --git a/include/linux/dmem.h b/include/linux/dmem.h index 8682d63ed43a..59d3ef14fe42 100644 --- a/include/linux/dmem.h +++ b/include/linux/dmem.h @@ -19,11 +19,18 @@ dmem_alloc_pages_vma(struct vm_area_struct *vma, unsigned long addr, unsigned int try_max, unsigned int *result_nr); void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr); +bool is_dmem_pfn(unsigned long pfn); #define dmem_free_page(addr) dmem_free_pages(addr, 1) #else static inline int dmem_reserve_init(void) { return 0; } + +static inline bool is_dmem_pfn(unsigned long pfn) +{ + return 0; +} + #endif #endif /* _LINUX_DMEM_H */ diff --git a/mm/dmem.c b/mm/dmem.c index 2e61dbddbc62..eb6df7059cf0 100644 --- a/mm/dmem.c +++ b/mm/dmem.c @@ -972,3 +972,10 @@ void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr) } EXPORT_SYMBOL(dmem_free_pages); +bool is_dmem_pfn(unsigned long pfn) +{ + struct dmem_node *dnode; + + return !!find_dmem_region(__pfn_to_phys(pfn), &dnode); +} +EXPORT_SYMBOL(is_dmem_pfn); From patchwork Thu Oct 8 07:54:13 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822385 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 276E513B2 for ; Thu, 8 Oct 2020 07:55:59 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0373C21897 for ; Thu, 8 Oct 2020 07:55:58 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="fngZ4V7u" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729098AbgJHHzq (ORCPT ); Thu, 8 Oct 2020 03:55:46 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52256 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729040AbgJHHzm (ORCPT ); Thu, 8 Oct 2020 03:55:42 -0400 Received: from mail-pf1-x442.google.com (mail-pf1-x442.google.com [IPv6:2607:f8b0:4864:20::442]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B5382C0613D7; Thu, 8 Oct 2020 00:55:27 -0700 (PDT) Received: by mail-pf1-x442.google.com with SMTP id g10so3317316pfc.8; Thu, 08 Oct 2020 00:55:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=rgl0RB6gKY0poqmey3m0fH+Fzjy9yKEsxNNtoKcaNdI=; b=fngZ4V7uI0g9A6oGR8TT184yksleNQiusVTwWIjxfwwoKMU1wm78uDgw4GhFCinpLT +5h7GvXwXYZ/tSqURQ2Tq28OTn9cZdYpbuc9i/+5QfFNfKDql8jLuOJvp3pZMqzzSpoq JkVRBPdFlpg9p5p4nO8u2r6ghJvAjYSFs0VzTnFqWT3Qklsxq3dHrx1QJOD/FqOKir/8 c390Pqs5yl9bpYxI3jAZA1NmbAB5FRlB6FYY0jt/dbirBkws2o2k9Jnx+xEXda7bDKi5 lxSdolQfWzRLtbR+EYOMZUQ3G0mV1QvhE3zsSPnWXDu70Oqzf2AfwS6qh6JW6y4MgiHh W2fw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=rgl0RB6gKY0poqmey3m0fH+Fzjy9yKEsxNNtoKcaNdI=; b=IZpNoWgh+u7LvNhjZjxuf9BNmBooDMYkoC5hxF4crnpKMHjQZdcUfKtw0vbuNnLYM1 HBCX9kuxR1kNk7Jd2/OEctRfbM1dBglZuzsHKH790wwaGNgFfEzJqjlV9/YX4Ig20r8V oK+R+YtKlkT5FEsrC/MHx+L6Gm0Fab1mZt3zKU74LtgAC1YWrGOQ7iaMTmGtu+xleRQL ZmpPA4eDDdhHz646ExoxhfXRcv96cg6+E5gZxbe2VMUGWvL99VrIBv3Z59nz0Odhvau7 jkhF+Grr3qi8M7bfrsokIjElr68tjVir6KwkUr6MEq/N+8X4JKNTqVZBtFmGJXK5PCO3 doqw== X-Gm-Message-State: AOAM533RlCSC6Rly6NeFLM7/tuWUJu2AR8JkUrzwhrmO5dSyIN3xzbDt pba/q7PNGOo4dAwbwc/xHR602ix3zQ+NXA== X-Google-Smtp-Source: ABdhPJw6ynUy5YDxI8vIljMLF5r1pNoOsWb1QWSg0/UqQKFCKO4W8P/RsEsn26/IjH6bTFmivX+QQA== X-Received: by 2002:a17:90b:f8b:: with SMTP id ft11mr6844289pjb.8.1602143727295; Thu, 08 Oct 2020 00:55:27 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.55.24 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:55:26 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [PATCH 23/35] kvm, x86: introduce VM_DMEM Date: Thu, 8 Oct 2020 15:54:13 +0800 Message-Id: <3c8fc6f37abe66c13348c9af2eacee04d4dfaa72.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang Currently dmemfs do not support memory readonly, so change_protection() will be disabled for dmemfs vma. Since vma->vm_flags could be changed to new flag in mprotect_fixup(), so we introduce a new vma flag VM_DMEM and check this flag in mprotect_fixup() to avoid changing vma->vm_flags. We also check it in vma_to_resize() to disable mremap() for dmemfs vma. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- fs/dmemfs/inode.c | 2 +- include/linux/mm.h | 7 +++++++ mm/mprotect.c | 5 ++++- mm/mremap.c | 3 +++ 4 files changed, 15 insertions(+), 2 deletions(-) diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index e37498c00497..b3e394f33b42 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -510,7 +510,7 @@ int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma) if (!(vma->vm_flags & VM_SHARED)) return -EINVAL; - vma->vm_flags |= VM_PFNMAP; + vma->vm_flags |= VM_PFNMAP | VM_DMEM | VM_IO; file_accessed(file); vma->vm_ops = &dmemfs_vm_ops; diff --git a/include/linux/mm.h b/include/linux/mm.h index ca6e6a81576b..7b1e574d2387 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -309,6 +309,8 @@ extern unsigned int kobjsize(const void *objp); #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ +#define VM_DMEM BIT(38) /* Dmem page VM */ + #ifdef CONFIG_ARCH_HAS_PKEYS # define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0 # define VM_PKEY_BIT0 VM_HIGH_ARCH_0 /* A protection key is a 4-bit value */ @@ -656,6 +658,11 @@ static inline bool vma_is_accessible(struct vm_area_struct *vma) return vma->vm_flags & VM_ACCESS_FLAGS; } +static inline bool vma_is_dmem(struct vm_area_struct *vma) +{ + return !!(vma->vm_flags & VM_DMEM); +} + #ifdef CONFIG_SHMEM /* * The vma_is_shmem is not inline because it is used only by slow diff --git a/mm/mprotect.c b/mm/mprotect.c index ce8b8a5eacbb..36f885cbbb30 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -236,7 +236,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, * for all the checks. */ if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) && - pmd_none_or_clear_bad_unless_trans_huge(pmd)) + pmd_none_or_clear_bad_unless_trans_huge(pmd) && !pmd_special(*pmd)) goto next; /* invoke the mmu notifier if the pmd is populated */ @@ -412,6 +412,9 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev, return 0; } + if (vma_is_dmem(vma)) + return -EINVAL; + /* * Do PROT_NONE PFN permission checks here when we can still * bail out without undoing a lot of state. This is a rather diff --git a/mm/mremap.c b/mm/mremap.c index 138abbae4f75..598e68174e24 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -482,6 +482,9 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr, if (!vma || vma->vm_start > addr) return ERR_PTR(-EFAULT); + if (vma_is_dmem(vma)) + return ERR_PTR(-EINVAL); + /* * !old_len is a special case where an attempt is made to 'duplicate' * a mapping. This makes no sense for private mappings as it will From patchwork Thu Oct 8 07:54:14 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822377 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 8327B109B for ; Thu, 8 Oct 2020 07:55:46 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5015421897 for ; Thu, 8 Oct 2020 07:55:46 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="TcvgfWTv" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729069AbgJHHzp (ORCPT ); Thu, 8 Oct 2020 03:55:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52258 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729041AbgJHHzm (ORCPT ); Thu, 8 Oct 2020 03:55:42 -0400 Received: from mail-pg1-x542.google.com (mail-pg1-x542.google.com [IPv6:2607:f8b0:4864:20::542]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id CC462C0613D8; Thu, 8 Oct 2020 00:55:31 -0700 (PDT) Received: by mail-pg1-x542.google.com with SMTP id x16so3601121pgj.3; Thu, 08 Oct 2020 00:55:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=+D+2jt/lSuuv5ngz/ELFFsRQ4iKFzjc5y47NExgYtsk=; b=TcvgfWTvlwyihLEatUZPA4DfwXuPBYdxYxsTqPa8ZdtWUbnm9rtcAkT/7Et6Sk/MNA s8s+DpwCf67JvoXW7URCCoLiVUozFwQ5NN5RsK5YMRltwBP4Zzw0G1KUEFsqw10ccowZ c5WfSyAxoFkHjDwvGVpGhUtb2p0PBwBpj7XCZmuyh3lE1YliF1EMrGYanfKiD2kNMBQx QVAnbEXbiWYIKyLnUtzR6nOeeJntLp504YbpTd3woPyYrLfena1jQ/LP9vVsy7Ei7X+Y 7uau9Qp+78+cp3nvJXMgVhpZqbC552v3oXqLfPjxUwD6Do7huS9eoNA7J1bMwpNwcnDN rNuA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=+D+2jt/lSuuv5ngz/ELFFsRQ4iKFzjc5y47NExgYtsk=; b=YpTxwOGKpUt+ZU7un/C2/zLNXwJ7tC4QVm7xSviqFap6SqdhbaoUvRo7qz7dBGSUjy brVddpYiFXsfkjNsz2olWciF/+I3vazg/utbzGgxJ6G6P3PqIcuBYIsxxKGGOT2LD6Nx 64ECJLJ003NElVO69iTtoww25CvmTzyVtG6oe4VWcC1JoxXlMfUGl2IdIZZh8HiHpEm3 kPqsYlSIVeEqY/9RpR53fIqMv0oB4uB4Y570RjhsFdXHB3jRORNfs3CdBt5qv1QV7Vwt SDQ4+pd4aJG7sQBTF3ukZOzzJcfQ3xFNzeV2kSpC5Y0XVgd0jrvVYR4LQNqZAYSWMLDI v8Uw== X-Gm-Message-State: AOAM530TMVkIJ6RZUEkpCMaAvgePwypUUpEuYtkvSdOK/o/Irlrrf2AG KMkwMSDymw3a7YM82GCW+qM= X-Google-Smtp-Source: ABdhPJwjl1+Gftssqt2c421sbJB6de2k1LCPW6sxmsPvotpeXPXE9JAkY7pY86kFzlUX5gGJakpTZQ== X-Received: by 2002:a63:d65:: with SMTP id 37mr6411206pgn.139.1602143731417; Thu, 08 Oct 2020 00:55:31 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.55.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:55:30 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [PATCH 24/35] dmemfs: support hugepage for dmemfs Date: Thu, 8 Oct 2020 15:54:14 +0800 Message-Id: <4d6038207c6472a0dd3084cbc77e70554fb9de91.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang It add hugepage support for dmemfs. We use PFN_DMEM to notify vmf_insert_pfn_pmd, and dmem huge pmd will be marked with _PAGE_SPECIAL and _PAGE_DMEM. So that GUP-fast can separate dmemfs page from other page type and handle it correctly. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- fs/dmemfs/inode.c | 113 +++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 111 insertions(+), 2 deletions(-) diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index b3e394f33b42..53a9bf214e0d 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -460,7 +460,7 @@ static int dmemfs_split(struct vm_area_struct *vma, unsigned long addr) return 0; } -static vm_fault_t dmemfs_fault(struct vm_fault *vmf) +static vm_fault_t __dmemfs_fault(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; struct inode *inode = file_inode(vma->vm_file); @@ -488,6 +488,63 @@ static vm_fault_t dmemfs_fault(struct vm_fault *vmf) return ret; } +static vm_fault_t __dmemfs_pmd_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + unsigned long pmd_addr = vmf->address & PMD_MASK; + unsigned long page_addr; + struct inode *inode = file_inode(vma->vm_file); + void *entry; + phys_addr_t phys; + pfn_t pfn; + int ret; + + if (dmem_page_size(inode) < PMD_SIZE) + return VM_FAULT_FALLBACK; + + WARN_ON(pmd_addr < vma->vm_start || + vma->vm_end < pmd_addr + PMD_SIZE); + + page_addr = vmf->address & ~(dmem_page_size(inode) - 1); + entry = radix_get_create_entry(vma, page_addr, inode, + linear_page_index(vma, page_addr)); + if (IS_ERR(entry)) + return (PTR_ERR(entry) == -ENOMEM) ? + VM_FAULT_OOM : VM_FAULT_SIGBUS; + + phys = dmem_addr_to_pfn(inode, dmem_entry_to_addr(inode, entry), + linear_page_index(vma, pmd_addr), PMD_SHIFT); + phys <<= PAGE_SHIFT; + pfn = phys_to_pfn_t(phys, PFN_DMEM); + ret = vmf_insert_pfn_pmd(vmf, pfn, !!(vma->vm_flags & VM_WRITE)); + + radix_put_entry(); + return ret; +} + +static vm_fault_t dmemfs_huge_fault(struct vm_fault *vmf, enum page_entry_size pe_size) +{ + int ret; + + switch (pe_size) { + case PE_SIZE_PTE: + ret = __dmemfs_fault(vmf); + break; + case PE_SIZE_PMD: + ret = __dmemfs_pmd_fault(vmf); + break; + default: + ret = VM_FAULT_SIGBUS; + } + + return ret; +} + +static vm_fault_t dmemfs_fault(struct vm_fault *vmf) +{ + return dmemfs_huge_fault(vmf, PE_SIZE_PTE); +} + static unsigned long dmemfs_pagesize(struct vm_area_struct *vma) { return dmem_page_size(file_inode(vma->vm_file)); @@ -498,6 +555,7 @@ static const struct vm_operations_struct dmemfs_vm_ops = { .fault = dmemfs_fault, .pagesize = dmemfs_pagesize, .access = dmemfs_access_dmem, + .huge_fault = dmemfs_huge_fault, }; int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma) @@ -510,15 +568,66 @@ int dmemfs_file_mmap(struct file *file, struct vm_area_struct *vma) if (!(vma->vm_flags & VM_SHARED)) return -EINVAL; - vma->vm_flags |= VM_PFNMAP | VM_DMEM | VM_IO; + vma->vm_flags |= VM_PFNMAP | VM_DONTCOPY | VM_DMEM | VM_IO; + + if (dmem_page_size(inode) != PAGE_SIZE) + vma->vm_flags |= VM_HUGEPAGE; file_accessed(file); vma->vm_ops = &dmemfs_vm_ops; return 0; } +/* + * If the size of area returned by mm->get_unmapped_area() is one + * dmem pagesize larger than 'len', the returned addr by + * mm->get_unmapped_area() could be aligned to dmem pagesize to + * meet alignment demand. + */ +static unsigned long +dmemfs_get_unmapped_area(struct file *file, unsigned long addr, + unsigned long len, unsigned long pgoff, + unsigned long flags) +{ + unsigned long len_pad; + unsigned long off = pgoff << PAGE_SHIFT; + unsigned long align; + + align = dmem_page_size(file_inode(file)); + + /* For pud or pmd pagesize, could not support fault fallback. */ + if (len & (align - 1)) + return -EINVAL; + if (len > TASK_SIZE) + return -ENOMEM; + + if (flags & MAP_FIXED) { + if (addr & (align - 1)) + return -EINVAL; + return addr; + } + + /* + * Pad a extra align space for 'len', as we want to find a unmapped + * area which is larger enough to align with dmemfs pagesize, if + * pagesize of dmem is larger than 4K. + */ + len_pad = (align == PAGE_SIZE) ? len : len + align; + + /* 'len' or 'off' is too large for pad. */ + if (len_pad < len || (off + len_pad) < off) + return -EINVAL; + + addr = current->mm->get_unmapped_area(file, addr, len_pad, + pgoff, flags); + + /* Now 'addr' could be aligned to upper boundary. */ + return IS_ERR_VALUE(addr) ? addr : round_up(addr, align); +} + static const struct file_operations dmemfs_file_operations = { .mmap = dmemfs_file_mmap, + .get_unmapped_area = dmemfs_get_unmapped_area, }; static int dmemfs_parse_param(struct fs_context *fc, struct fs_parameter *param) From patchwork Thu Oct 8 07:54:15 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822387 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B25C513B2 for ; Thu, 8 Oct 2020 07:56:01 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 88F4F21897 for ; Thu, 8 Oct 2020 07:56:01 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="b8NBVz8L" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729078AbgJHHzp (ORCPT ); Thu, 8 Oct 2020 03:55:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52262 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729043AbgJHHzm (ORCPT ); Thu, 8 Oct 2020 03:55:42 -0400 Received: from mail-pg1-x543.google.com (mail-pg1-x543.google.com [IPv6:2607:f8b0:4864:20::543]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EE3F3C0613D9; Thu, 8 Oct 2020 00:55:35 -0700 (PDT) Received: by mail-pg1-x543.google.com with SMTP id r21so230278pgj.5; Thu, 08 Oct 2020 00:55:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=O7KgnpfqtBbUkq2cDwxXf9kTKX6Grw9EaxzUExBq/K0=; b=b8NBVz8LhXuNHd+ITvEO/nDSIRbO/9WK+JDJ/Ktk2N795T1Tgth+QBzhTxqop9dY/s laiCXpKPS3kwHQuF7eNL68OOeQ0YHZ3kGJMKusmfdBuhJkjmk9AzebE/2Bp6pW8AS7Dd aeEx2fBGaD5hG56qmFpGOn0GbF7vBgNYod8nPwq1YPRPBpw8s7SZe0GWl6X1RFALUZ4j a7eETflyj4nJwnwKgSiQnfC/IQecHabDHGkT9OQvruDzUK72bs5rLWBgXVPrYI4HqK4m TetiBM5DPcwJJPC6H2GKJ47jHr0IvkLo9UA5sxztR0vk09CA4bUBGoHWcBteo0pmz0A8 IIUg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=O7KgnpfqtBbUkq2cDwxXf9kTKX6Grw9EaxzUExBq/K0=; b=FTzbSPXTy9TRy3EV4Ho1qLj20xJaJb72jecG2Hgl5CCc9EuI9hfZJQ1wEw+gfx0+zx GqXf4q77qyW3cyO6IE33/nT+hwPlRwxJOgZ5q1Rksa7bMgX6c6NkAnMksFzqslJY2f4O 6dtT85NtKSJaji3o0ZtgMeXOZKgr6D+DWMquKRLQvUaxFz6QHEPVnncVZDeMdsSm/9bQ q1O2akbmRYoSWPOB0FZDk9yBxE57/389ToK4KDfjn8XYLdo02rNG8ek2V84HPLY4/xtV 0zdE4SDYn4j4E8g3fLgfOCmq4Da05uHqivw1YExJ1NurdQNNXFRLCb7KBUUOMhztydGK G/3g== X-Gm-Message-State: AOAM532rc0fl3PlOHAuolaAq7dhXOBE3admF19aw9uKJ21Xu1x8YPmRt NX8ABmE9mEQuqd9f5jqA+7I= X-Google-Smtp-Source: ABdhPJyleLfKb682LDA69C1U3Po3BXYpQi6U+mncLSpwR3jjSMwK/hu67zP7yA4aOc9mj30ghga09A== X-Received: by 2002:a17:90a:7d16:: with SMTP id g22mr6887590pjl.135.1602143735614; Thu, 08 Oct 2020 00:55:35 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.55.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:55:35 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [PATCH 25/35] mm, x86, dmem: fix estimation of reserved page for vaddr_get_pfn() Date: Thu, 8 Oct 2020 15:54:15 +0800 Message-Id: X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang Fix estimation of reserved page for vaddr_get_pfn() and check 'ret' before checking writable permission Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- drivers/vfio/vfio_iommu_type1.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c index 5fbf0c1f7433..257a8cab0a77 100644 --- a/drivers/vfio/vfio_iommu_type1.c +++ b/drivers/vfio/vfio_iommu_type1.c @@ -471,6 +471,10 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr, if (ret == -EAGAIN) goto retry; + if (!ret && (prot & IOMMU_WRITE) && + !(vma->vm_flags & VM_WRITE)) + ret = -EFAULT; + if (!ret && !is_invalid_reserved_pfn(*pfn)) ret = -EFAULT; } From patchwork Thu Oct 8 07:54:16 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822381 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 83F55109B for ; Thu, 8 Oct 2020 07:55:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5B51E2193E for ; Thu, 8 Oct 2020 07:55:50 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="HlUSHDav" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729112AbgJHHzs (ORCPT ); Thu, 8 Oct 2020 03:55:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52266 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728620AbgJHHzm (ORCPT ); Thu, 8 Oct 2020 03:55:42 -0400 Received: from mail-pl1-x643.google.com (mail-pl1-x643.google.com [IPv6:2607:f8b0:4864:20::643]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 29DAEC0613DC; Thu, 8 Oct 2020 00:55:40 -0700 (PDT) Received: by mail-pl1-x643.google.com with SMTP id c6so2363443plr.9; Thu, 08 Oct 2020 00:55:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=baP6qc0YYj+DmTYY1P4mAyPBjNnkQzpG/qxwTfH16NY=; b=HlUSHDav+EhjoX8LnlxBqvY2+riyA8I6hOwYOI9ZJa/Jg3ppNe1yOU9rIeK4RDmlYN SVOsNko6yENO0fWa1FL+DZrDXX/SSwYM1px2al6NBrI9MFDU3W5ieC193pUPoxS1uIjC l66JFsudPvUe3au7r25G8o/B+VK68M/JLPfSUMtEIY33LHZGmas/Ti+9HjdibhPfV4p2 n+mhbbVmU3u7bUVALqw5c13vM4iB3tYsDQd4gFG+WZVV4WzRY3OOIInNUSfwq53rUULB Dfa+9yhFNGbVm9jAMW4Zwhb8UZKQRWIoi13rBQcLeNh5pt+OKYyBAaNe8xEgZgd94yqj Gy/A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=baP6qc0YYj+DmTYY1P4mAyPBjNnkQzpG/qxwTfH16NY=; b=pKvNJlfbztP7CcAXjypCwptx3vHXvTfMHrfnYAmDOIeIBBtgbghe1NfnMan1NLeuuX 8Xrgp0S97qtJOdd04Rq1Rj1qHbJmXlRFJiYXZe/Xw7qMVuDcljw4FBIfr2yuEMTeS97h Z4zk76AeMr0RdVAwUJdoepZ25ONfRFtbSzbNYRtt/BSepSn263Mwvkjo5yUqrAOViedz llgOFUa3I74xFQot9vJFl1Htx6D1vumQ0I6mspJ6k1qtZ+JniAdf9u0QSGEDaoo9lS8F C33EIbayvEItblBgRyNOjTiQX1Z2Ipa44Nz84jhARicdcwNUcUK5fV94Zlz7WlpNuaI6 b17A== X-Gm-Message-State: AOAM533Ub0zZMqBLRXxQbBgBK2L9EVXX27ZRN19rPA2fVGBijnss5So6 zlse3HGWLioVeBGIKy4GaAc= X-Google-Smtp-Source: ABdhPJxwJ8nEIz12OEqiqat5/gJAm4Dlr4Kznc/XMa50UHEHUR3Q+1dDpcKPk4ntPTyCjt7vpb0t9A== X-Received: by 2002:a17:902:7d8d:b029:d3:95b9:68ed with SMTP id a13-20020a1709027d8db02900d395b968edmr6433338plm.28.1602143739775; Thu, 08 Oct 2020 00:55:39 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.55.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:55:39 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [PATCH 26/35] mm, dmem: introduce pud_special() Date: Thu, 8 Oct 2020 15:54:16 +0800 Message-Id: X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang pud_special() will check both _PAGE_SPECIAL and _PAGE_DMEM bit as pmd_special() does. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- arch/x86/include/asm/pgtable.h | 13 +++++++++++++ include/linux/pgtable.h | 10 ++++++++++ 2 files changed, 23 insertions(+) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index e29601cad384..313fb4fd6645 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -282,6 +282,12 @@ static inline int pmd_special(pmd_t pmd) return (pmd_val(pmd) & (_PAGE_SPECIAL | _PAGE_DMEM)) == (_PAGE_SPECIAL | _PAGE_DMEM); } + +static inline int pud_special(pud_t pud) +{ + return (pud_val(pud) & (_PAGE_SPECIAL | _PAGE_DMEM)) == + (_PAGE_SPECIAL | _PAGE_DMEM); +} #endif #ifdef CONFIG_ARCH_HAS_PTE_DEVMAP @@ -517,6 +523,13 @@ static inline pud_t pud_mkdirty(pud_t pud) return pud_set_flags(pud, _PAGE_DIRTY | _PAGE_SOFT_DIRTY); } +#ifdef CONFIG_ARCH_HAS_PTE_DMEM +static inline pud_t pud_mkdmem(pud_t pud) +{ + return pud_set_flags(pud, _PAGE_SPECIAL | _PAGE_DMEM); +} +#endif + static inline pud_t pud_mkdevmap(pud_t pud) { return pud_set_flags(pud, _PAGE_DEVMAP); diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 1fe8546c0a7c..50f27d61f5cd 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1139,6 +1139,16 @@ static inline int pmd_special(pmd_t pmd) { return 0; } + +static inline pud_t pud_mkdmem(pud_t pud) +{ + return pud; +} + +static inline int pud_special(pud_t pud) +{ + return 0; +} #endif #ifndef pmd_read_atomic From patchwork Thu Oct 8 07:54:17 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822425 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BB000109B for ; Thu, 8 Oct 2020 07:56:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 86FC92083B for ; Thu, 8 Oct 2020 07:56:57 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Ymh6vTtx" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729262AbgJHH44 (ORCPT ); Thu, 8 Oct 2020 03:56:56 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52262 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729063AbgJHHzo (ORCPT ); Thu, 8 Oct 2020 03:55:44 -0400 Received: from mail-pf1-x441.google.com (mail-pf1-x441.google.com [IPv6:2607:f8b0:4864:20::441]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 90C43C0613D6; Thu, 8 Oct 2020 00:55:44 -0700 (PDT) Received: by mail-pf1-x441.google.com with SMTP id n14so3322261pff.6; Thu, 08 Oct 2020 00:55:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=GLdCsDG8ctFS/IaazURKqVHKZeMbDoGOldSLDYhqodw=; b=Ymh6vTtxX6EemmN/d7QzIlZfSqRo6xlpNvCF2BCqgJvYYG6zd/e3mGU4qBGZQvGCpY D9PSW3uOm4t6AzV+XURBiWNwHdxSD9Rzpxs3XysQe40So8lMStI/7w/v7Xp+ZtB2jEaO YY51s2CRHcObYPHlExH3j0+wiUymftWgaAPlwQRnWh0HJI4cDolGyhXGQmg8RJMf+nS7 SbvPpk7cZWlPHRilIPCZLDiqSIYEqjTmFa20VMcgr8rbv6PcpToXRNsU03t3U1W10reX 9ebdbHfCtYOVM4rlIDrmNyHJJUNoeqtxlJArwf8K12zROQCZ/t8zNUBZhJIdaW6ucXoT COhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=GLdCsDG8ctFS/IaazURKqVHKZeMbDoGOldSLDYhqodw=; b=ZjnGo3m3qR8sZGqy4l1XtotXWiiMpKNDaciowr1nL95ji+hQWRh6mpOF0TBP6XrhcK 1cFKHvC2tHP2tBmvDpc+YX9c3NISp8rdSLjSY2NGuWMXs2Ln5sP5z451U5UCIteXvsSW FytQBMAD77pHprG6ymmQGCmYOtsYUEdamnFZlArIRL/MCvAPZmk4Gi76aiv8dzF7JqdI t2MwG3NFla3TPXSMem/K8p7ABfldU6j+mWV6v9zxHs3/B2OBq/IEBXxNBNAtsg+EeA0K 63cpxMKMYQbsu7yqQ6exlCOJ/73eAmk4U0fUYCn51tX2EZE/W05NjPBcfHtlMEAZ5/b8 OyCA== X-Gm-Message-State: AOAM5312D/PnoYscKi1f6zviN5WZU7pYdwzC4ONhv8XhBoxGxW+tBWMC FhXaHbiaO+FbVZrIP8iNM2o= X-Google-Smtp-Source: ABdhPJwcSszzr01fRpOkdPyRHaEHglnwh3D28o/3IibTaaYJl4wVZanAVU3UNnnVaH6gB1IjbtlWXQ== X-Received: by 2002:a63:1665:: with SMTP id 37mr6436449pgw.383.1602143744116; Thu, 08 Oct 2020 00:55:44 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.55.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:55:43 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [PATCH 27/35] mm: add pud_special() to support dmem huge pud Date: Thu, 8 Oct 2020 15:54:17 +0800 Message-Id: X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang Add pud_special() and follow_special_pud() to support dmem huge pud as we do for dmem huge pmd. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- arch/x86/include/asm/pgtable.h | 2 +- include/linux/huge_mm.h | 2 +- mm/gup.c | 46 ++++++++++++++++++++++++++++++++++ mm/huge_memory.c | 11 +++++--- mm/memory.c | 4 +-- mm/mprotect.c | 2 ++ 6 files changed, 59 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 313fb4fd6645..c9a3b1f79cd5 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -266,7 +266,7 @@ static inline int pmd_trans_huge(pmd_t pmd) #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD static inline int pud_trans_huge(pud_t pud) { - return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE; + return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP|_PAGE_DMEM)) == _PAGE_PSE; } #endif diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index b7381e5aafe5..ac8eb3e39575 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -254,7 +254,7 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd, static inline spinlock_t *pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma) { - if (pud_trans_huge(*pud) || pud_devmap(*pud)) + if (pud_trans_huge(*pud) || pud_devmap(*pud) || pud_special(*pud)) return __pud_trans_huge_lock(pud, vma); else return NULL; diff --git a/mm/gup.c b/mm/gup.c index a8edbb6a2b2f..fdcaeb163bc4 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -416,6 +416,42 @@ follow_special_pmd(struct vm_area_struct *vma, unsigned long address, return ERR_PTR(-EEXIST); } +static struct page * +follow_special_pud(struct vm_area_struct *vma, unsigned long address, + pud_t *pud, unsigned int flags) +{ + spinlock_t *ptl; + + if (flags & FOLL_DUMP) + /* Avoid special (like zero) pages in core dumps */ + return ERR_PTR(-EFAULT); + + /* No page to get reference */ + if (flags & FOLL_GET) + return ERR_PTR(-EFAULT); + + if (flags & FOLL_TOUCH) { + pud_t _pud; + + ptl = pud_lock(vma->vm_mm, pud); + if (!pud_special(*pud)) { + spin_unlock(ptl); + return NULL; + } + _pud = pud_mkyoung(*pud); + if (flags & FOLL_WRITE) + _pud = pud_mkdirty(_pud); + if (pudp_set_access_flags(vma, address & HPAGE_PMD_MASK, + pud, _pud, + flags & FOLL_WRITE)) + update_mmu_cache_pud(vma, address, pud); + spin_unlock(ptl); + } + + /* Proper page table entry exists, but no corresponding struct page */ + return ERR_PTR(-EEXIST); +} + /* * FOLL_FORCE can write to even unwritable pte's, but only * after we've gone through a COW cycle and they are dirty. @@ -716,6 +752,12 @@ static struct page *follow_pud_mask(struct vm_area_struct *vma, return page; return no_page_table(vma, flags); } + if (pud_special(*pud)) { + page = follow_special_pud(vma, address, pud, flags); + if (page) + return page; + return no_page_table(vma, flags); + } if (is_hugepd(__hugepd(pud_val(*pud)))) { page = follow_huge_pd(vma, address, __hugepd(pud_val(*pud)), flags, @@ -2478,6 +2520,10 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr, if (!pud_access_permitted(orig, flags & FOLL_WRITE)) return 0; + /* Bypass dmem pud. It will be handled in outside routine. */ + if (pud_special(orig)) + return 0; + if (pud_devmap(orig)) { if (unlikely(flags & FOLL_LONGTERM)) return 0; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a24601c93713..29e1ab959c90 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -883,6 +883,8 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr, entry = pud_mkhuge(pfn_t_pud(pfn, prot)); if (pfn_t_devmap(pfn)) entry = pud_mkdevmap(entry); + if (pfn_t_dmem(pfn)) + entry = pud_mkdmem(entry); if (write) { entry = pud_mkyoung(pud_mkdirty(entry)); entry = maybe_pud_mkwrite(entry, vma); @@ -919,7 +921,7 @@ vm_fault_t vmf_insert_pfn_pud_prot(struct vm_fault *vmf, pfn_t pfn, * can't support a 'special' bit. */ BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) && - !pfn_t_devmap(pfn)); + !pfn_t_devmap(pfn) && !pfn_t_dmem(pfn)); BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) == (VM_PFNMAP|VM_MIXEDMAP)); BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags)); @@ -1883,7 +1885,7 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma) spinlock_t *ptl; ptl = pud_lock(vma->vm_mm, pud); - if (likely(pud_trans_huge(*pud) || pud_devmap(*pud))) + if (likely(pud_trans_huge(*pud) || pud_devmap(*pud) || pud_special(*pud))) return ptl; spin_unlock(ptl); return NULL; @@ -1894,6 +1896,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud, unsigned long addr) { spinlock_t *ptl; + pud_t orig_pud; ptl = __pud_trans_huge_lock(pud, vma); if (!ptl) @@ -1904,9 +1907,9 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, * pgtable_trans_huge_withdraw after finishing pudp related * operations. */ - pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm); + orig_pud = pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm); tlb_remove_pud_tlb_entry(tlb, pud, addr); - if (vma_is_special_huge(vma)) { + if (vma_is_special_huge(vma) || pud_special(orig_pud)) { spin_unlock(ptl); /* No zero page support yet */ } else { diff --git a/mm/memory.c b/mm/memory.c index ca42a6e56e9b..3748fab7cc2a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -922,7 +922,7 @@ static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src src_pud = pud_offset(src_p4d, addr); do { next = pud_addr_end(addr, end); - if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) { + if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud) || pud_special(*src_pud)) { int err; VM_BUG_ON_VMA(next-addr != HPAGE_PUD_SIZE, vma); @@ -1215,7 +1215,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb, pud = pud_offset(p4d, addr); do { next = pud_addr_end(addr, end); - if (pud_trans_huge(*pud) || pud_devmap(*pud)) { + if (pud_trans_huge(*pud) || pud_devmap(*pud) || pud_special(*pud)) { if (next - addr != HPAGE_PUD_SIZE) { mmap_assert_locked(tlb->mm); split_huge_pud(vma, pud, addr); diff --git a/mm/mprotect.c b/mm/mprotect.c index 36f885cbbb30..cae78c0c5160 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -292,6 +292,8 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma, pud = pud_offset(p4d, addr); do { next = pud_addr_end(addr, end); + if (pud_special(*pud)) + continue; if (pud_none_or_clear_bad(pud)) continue; pages += change_pmd_range(vma, pud, addr, next, newprot, From patchwork Thu Oct 8 07:54:18 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822401 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1B6A0109B for ; Thu, 8 Oct 2020 07:56:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id EA48521941 for ; Thu, 8 Oct 2020 07:56:22 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="gozkt5zB" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729077AbgJHH4T (ORCPT ); Thu, 8 Oct 2020 03:56:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52290 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729114AbgJHHzs (ORCPT ); Thu, 8 Oct 2020 03:55:48 -0400 Received: from mail-pl1-x643.google.com (mail-pl1-x643.google.com [IPv6:2607:f8b0:4864:20::643]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9A5F5C061755; Thu, 8 Oct 2020 00:55:48 -0700 (PDT) Received: by mail-pl1-x643.google.com with SMTP id bb1so2375567plb.2; Thu, 08 Oct 2020 00:55:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=yi7jGxxMRJTKFHMt31rtyi+qjpDfCVRN9Gvuq5A1fhY=; b=gozkt5zB0DgYVhBidJhV+3vMXvqRNyABsdhKqCq86ORheKAC8rxjkpILlfOy5sDRnh OWypeMNFz+MRE/KvQf0Iep9F+oqDQ0inncfWzr66ewlRDYUsZ+JjIXGnj2TesfBBNo3f +iIHhotwCu5A5fEjoWzcqMf5V9XLrVeKeZy20Vtr1mwPIg9EnFOO1GSSMuwAgDbcBpXv SqgOznz8FarHLZQs6/XLHvurg4ZzWpzfiUraE6vQkRpXon/gtopXQb+9uxA5Dc7jjl1q GCrc+go7/JRvYFlbqAln9NVPmm41R01BYy5Hk6eY/IUnysJ+TsuH/Ut8E05TsVh8l3hC B0DA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=yi7jGxxMRJTKFHMt31rtyi+qjpDfCVRN9Gvuq5A1fhY=; b=hzIs5oAqoBaywUlCAytMwu9NqFinLrpOlxbWp7mt3krmyCzCN4UIFajQEsjnjrJ+D1 CLL/R9SkaJ2m0aAOIla19UXVjaEnmzyyePIg1RzwXg2DF2w1RYP+5G8Xs929L6S5FOLv U9bWfQgpjE0joIvFA4gRz1zy0314qeONUbh59JMAkze7JE3UGXS95HLS9sB7ubqnl++R mgCp+N/uw5fwbz5mVh0VeldbMdjbvn4lE72OVYhqEUP+lGdZFtScdNLG2TrRP7wU51zL Ecr1wrS73Yk1sssRvLxEENxADqrfT6nvOWIZJYXwRDZPqFCuCiU6eFiXT54ZSIKu5ui1 Uobg== X-Gm-Message-State: AOAM530xE5POnMgZuCclHv7spB0UnJ2uzqT4YVufTL68D1wpLtVFsSd+ OOJclY6j/X3nkDlRV1CVotQ= X-Google-Smtp-Source: ABdhPJz089XPEbawgMw+wnqaJGiaMcZofnvQOPXXqAfc3qfL7nSL9Oiv4dpga2EV5DLU23FnwN3xJg== X-Received: by 2002:a17:902:ea8c:b029:d2:8abd:c8de with SMTP id x12-20020a170902ea8cb02900d28abdc8demr6617467plb.21.1602143748229; Thu, 08 Oct 2020 00:55:48 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.55.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:55:47 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [PATCH 28/35] mm, dmemfs: support huge_fault() for dmemfs Date: Thu, 8 Oct 2020 15:54:18 +0800 Message-Id: <4c905a63ed6c68fd23f81e1aafb6a41197a85909.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang Introduce __dmemfs_huge_fault() to handle 1G huge pud for dmemfs. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- fs/dmemfs/inode.c | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index 53a9bf214e0d..027428a7f7a0 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -522,6 +522,43 @@ static vm_fault_t __dmemfs_pmd_fault(struct vm_fault *vmf) return ret; } +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD +static vm_fault_t __dmemfs_huge_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + unsigned long pud_addr = vmf->address & PUD_MASK; + struct inode *inode = file_inode(vma->vm_file); + void *entry; + phys_addr_t phys; + pfn_t pfn; + int ret; + + if (dmem_page_size(inode) < PUD_SIZE) + return VM_FAULT_FALLBACK; + + WARN_ON(pud_addr < vma->vm_start || + vma->vm_end < pud_addr + PUD_SIZE); + + entry = radix_get_create_entry(vma, pud_addr, inode, + linear_page_index(vma, pud_addr)); + if (IS_ERR(entry)) + return (PTR_ERR(entry) == -ENOMEM) ? + VM_FAULT_OOM : VM_FAULT_SIGBUS; + + phys = dmem_entry_to_addr(inode, entry); + pfn = phys_to_pfn_t(phys, PFN_DMEM); + ret = vmf_insert_pfn_pud(vmf, pfn, !!(vma->vm_flags & VM_WRITE)); + + radix_put_entry(); + return ret; +} +#else +static vm_fault_t __dmemfs_huge_fault(struct vm_fault *vmf) +{ + return VM_FAULT_FALLBACK; +} +#endif /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */ + static vm_fault_t dmemfs_huge_fault(struct vm_fault *vmf, enum page_entry_size pe_size) { int ret; @@ -533,6 +570,9 @@ static vm_fault_t dmemfs_huge_fault(struct vm_fault *vmf, enum page_entry_size p case PE_SIZE_PMD: ret = __dmemfs_pmd_fault(vmf); break; + case PE_SIZE_PUD: + ret = __dmemfs_huge_fault(vmf); + break; default: ret = VM_FAULT_SIGBUS; } From patchwork Thu Oct 8 07:54:19 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822391 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 10742109B for ; Thu, 8 Oct 2020 07:56:05 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A0417206F4 for ; Thu, 8 Oct 2020 07:56:04 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="vd9Dnz8e" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729170AbgJHH4D (ORCPT ); Thu, 8 Oct 2020 03:56:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52304 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729131AbgJHHzw (ORCPT ); Thu, 8 Oct 2020 03:55:52 -0400 Received: from mail-pf1-x441.google.com (mail-pf1-x441.google.com [IPv6:2607:f8b0:4864:20::441]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BED9AC0613D2; Thu, 8 Oct 2020 00:55:52 -0700 (PDT) Received: by mail-pf1-x441.google.com with SMTP id k8so3331909pfk.2; Thu, 08 Oct 2020 00:55:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=LyBV6N3gjq3MiLxBfZ3JOHalx3ds96tA29NRwEvbdoI=; b=vd9Dnz8emLxl2V0MxRbRE8ukBDhBRtccJzUehh/zwzrFX58KybIlfgmtQJ04bVTuD+ ChU852jZS1a6muWTlmD2z4byGPamv02s2TKdjDlLGQy5EVqnySrqDzcIdw4LVN+Ivd9a P1fICw3poYzu1PShO70s1IICT6lDQRwMQK8uS7ciPRkKvtxfpAARUfgpgu9pGqqv7cEz FOrM4hjxUiagkjqFMEfNipPJK35UQv3aLfUDWT49eQJaUBdHZhoPBuhoMHK3eAzd/KkU 3KSWL1JK2UvxDtZn8qCXmvMK5oaix3Q89m+VHUSKRYDHZKTZBqnCwjmZUkRrP5CnhZ8r LUJw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=LyBV6N3gjq3MiLxBfZ3JOHalx3ds96tA29NRwEvbdoI=; b=iXidjNyXOJamFenCey8QfO9crVNwrEHpOU3htr9351ifcf/PIX9wDBDR8v1r6/jzuq zyphTJLDh3jSe68ZxCle5azr/YwIKe+711/Y7Gi1TuHNNGhKhBzfi1Y3Gzn0mrUTzWv+ qH3T+JRREGnoNkCDPhhpYMq7k/MAJaCXSWoxNJVoeNvhiPeXlLgQV/PcfdWRM4m8/7dH naA3SS7ohkbhlVeDJVxFr9EZo0/bfPH936rmMS2ELIqjUX9n96/R7sAyjlGoBDomLnsY RHAPa64vSEsxrLVh26DBzwB+gHFa6RaH5naQikz9oeAET2zAf5FCYyji+Aw6TL67v76K Dtdg== X-Gm-Message-State: AOAM532BdkJgEQrCyPedwpCMYokAahaik46Vh0R7MZghEErwgvLDgmFQ 7bQN3oXHMUB7dqx1LIWTdAs= X-Google-Smtp-Source: ABdhPJzchyQcPfaxUgFQfXjYWLFWkCzMJoJcb2OICRAl5QnyfvZRb6BniUxZb6dGsJpVC7oewwtwAQ== X-Received: by 2002:a63:d65:: with SMTP id 37mr6412097pgn.139.1602143752415; Thu, 08 Oct 2020 00:55:52 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.55.49 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:55:51 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [PATCH 29/35] mm: add follow_pte_pud() Date: Thu, 8 Oct 2020 15:54:19 +0800 Message-Id: <71f27b2b03471adbbacccb9a5107b6533a18a957.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang Since we had supported dmem huge pud, here support dmem huge pud for hva_to_pfn(). Similar to follow_pte_pmd(), follow_pte_pud() allows a PTE lead or a huge page PMD or huge page PUD to be found and returned. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- mm/memory.c | 52 ++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 44 insertions(+), 8 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 3748fab7cc2a..f831ab4b7ccd 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4535,9 +4535,9 @@ int __pmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long address) } #endif /* __PAGETABLE_PMD_FOLDED */ -static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, +static int __follow_pte_pud(struct mm_struct *mm, unsigned long address, struct mmu_notifier_range *range, - pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp) + pte_t **ptepp, pmd_t **pmdpp, pud_t **pudpp, spinlock_t **ptlp) { pgd_t *pgd; p4d_t *p4d; @@ -4554,6 +4554,26 @@ static int __follow_pte_pmd(struct mm_struct *mm, unsigned long address, goto out; pud = pud_offset(p4d, address); + VM_BUG_ON(pud_trans_huge(*pud)); + if (pud_huge(*pud)) { + if (!pudpp) + goto out; + + if (range) { + mmu_notifier_range_init(range, MMU_NOTIFY_CLEAR, 0, + NULL, mm, address & PUD_MASK, + (address & PUD_MASK) + PUD_SIZE); + mmu_notifier_invalidate_range_start(range); + } + *ptlp = pud_lock(mm, pud); + if (pud_huge(*pud)) { + *pudpp = pud; + return 0; + } + spin_unlock(*ptlp); + if (range) + mmu_notifier_invalidate_range_end(range); + } if (pud_none(*pud) || unlikely(pud_bad(*pud))) goto out; @@ -4609,8 +4629,8 @@ static inline int follow_pte(struct mm_struct *mm, unsigned long address, /* (void) is needed to make gcc happy */ (void) __cond_lock(*ptlp, - !(res = __follow_pte_pmd(mm, address, NULL, - ptepp, NULL, ptlp))); + !(res = __follow_pte_pud(mm, address, NULL, + ptepp, NULL, NULL, ptlp))); return res; } @@ -4622,12 +4642,24 @@ int follow_pte_pmd(struct mm_struct *mm, unsigned long address, /* (void) is needed to make gcc happy */ (void) __cond_lock(*ptlp, - !(res = __follow_pte_pmd(mm, address, range, - ptepp, pmdpp, ptlp))); + !(res = __follow_pte_pud(mm, address, range, + ptepp, pmdpp, NULL, ptlp))); return res; } EXPORT_SYMBOL(follow_pte_pmd); +int follow_pte_pud(struct mm_struct *mm, unsigned long address, + struct mmu_notifier_range *range, + pte_t **ptepp, pmd_t **pmdpp, pud_t **pudpp, spinlock_t **ptlp) +{ + int res; + + /* (void) is needed to make gcc happy */ + (void) __cond_lock(*ptlp, + !(res = __follow_pte_pud(mm, address, range, + ptepp, pmdpp, pudpp, ptlp))); + return res; +} /** * follow_pfn - look up PFN at a user virtual address * @vma: memory mapping @@ -4645,15 +4677,19 @@ int follow_pfn(struct vm_area_struct *vma, unsigned long address, spinlock_t *ptl; pte_t *ptep; pmd_t *pmdp = NULL; + pud_t *pudp = NULL; if (!(vma->vm_flags & (VM_IO | VM_PFNMAP))) return ret; - ret = follow_pte_pmd(vma->vm_mm, address, NULL, &ptep, &pmdp, &ptl); + ret = follow_pte_pud(vma->vm_mm, address, NULL, &ptep, &pmdp, &pudp, &ptl); if (ret) return ret; - if (pmdp) { + if (pudp) { + *pfn = pud_pfn(*pudp) + ((address & ~PUD_MASK) >> PAGE_SHIFT); + spin_unlock(ptl); + } else if (pmdp) { *pfn = pmd_pfn(*pmdp) + ((address & ~PMD_MASK) >> PAGE_SHIFT); spin_unlock(ptl); } else { From patchwork Thu Oct 8 07:54:20 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822393 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 76DED13B2 for ; Thu, 8 Oct 2020 07:56:14 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4463321927 for ; Thu, 8 Oct 2020 07:56:14 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Qcm+4YHW" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729163AbgJHH4C (ORCPT ); Thu, 8 Oct 2020 03:56:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52318 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728181AbgJHHz5 (ORCPT ); Thu, 8 Oct 2020 03:55:57 -0400 Received: from mail-pl1-x642.google.com (mail-pl1-x642.google.com [IPv6:2607:f8b0:4864:20::642]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1C893C0613D4; Thu, 8 Oct 2020 00:55:57 -0700 (PDT) Received: by mail-pl1-x642.google.com with SMTP id d23so2366196pll.7; Thu, 08 Oct 2020 00:55:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=tP/lJJiY+XHS3Tqjt8+lhvljOhGvK52JvoQA9+uEFQk=; b=Qcm+4YHWLy7vd7eZwG3uqWcHRKc0LUHmPsVkrMdKIt8ykEsi4x08XRZNML/b4w/n/e tayHd7M3Pmx1py2aGvGSeA8kcbN/DupSYdHK4IvkQZQElgNehtV2P0wm9y79Kz1R8nDF eX/sftMraCWByoAyLe/cKmm64PaKxMy5imvLYMUggSPH8ivQO4ukPeCarYjvD3+wHDc+ NKNQNYGb0nVAX2mBk5rWM3YLsqnXk38yYBjePF5Ln7w0KNPbkVCY4WV2NJ03GEr2xfDl JDdl8zGW1ApjHpf+BBQOFQ4RJ88g1KGQeaRBKgKNFqI3TTIWz6vsXi5HrFejtIBFDOOq zp1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=tP/lJJiY+XHS3Tqjt8+lhvljOhGvK52JvoQA9+uEFQk=; b=pfH1ynLoIsO92pzyEmivZRuAjMhiAHi8rd022UOTFCwXv/xBvS4xow8/2nIPgvqCrW iHhD29bOM1aC3gi7VVB4yF9ZycVAcrwSkwGrdO8Osw0FNBX5JPq7szWiAppCPKz5pY/J IJkOqG2UWBd4CxES2aoRRbi8t7pY9GQ7GCu88AFGUitDJvGr6Ks9cnCQwtPK0tUxbjBB hvZsrtUrzWeaNIHht8kYttegt3dT/0oD7hljXJ4t3vsCfAPKjH15gXNC9h2zt0N1d4QI AHy3tROKFt6Bg0Fswh/V9+qgde0/7CzxSOFDN02YH46dzLvkm47YWUnbLVsTboOF1RA9 uH7w== X-Gm-Message-State: AOAM5308gfHcnK8fy8AdtTg60iEFpNLyXjurtYa0jSWprCoBCeb1u3Yx x765OWT4HTqZvUo21xXfH7k= X-Google-Smtp-Source: ABdhPJyjs+IkeIkk5jzJbnSKGDDptAu+x0PXtjw/SvpIVj0BlEbb/SByMNcMW2ExCzt/vVrJMSI1Yw== X-Received: by 2002:a17:902:d888:b029:d0:cb2d:f274 with SMTP id b8-20020a170902d888b02900d0cb2df274mr6280077plz.13.1602143756683; Thu, 08 Oct 2020 00:55:56 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.55.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:55:56 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Chen Zhuo Subject: [PATCH 30/35] dmem: introduce dmem_bitmap_alloc() and dmem_bitmap_free() Date: Thu, 8 Oct 2020 15:54:20 +0800 Message-Id: X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang If dmem contained in dmem region is too large and dmemfs is mounted as 4K pagesize, size of bitmap in this dmem region maybe exceed maximal available memory of kzalloc(). It would cause kzalloc() fail. So introduce dmem_bitmap_alloc() and use vzalloc() if bitmap is larger than PAGE_SIZE as vzalloc() will get sparse page. Signed-off-by: Chen Zhuo Signed-off-by: Yulei Zhang --- fs/inode.c | 6 ++++ include/linux/fs.h | 1 + mm/dmem.c | 69 +++++++++++++++++++++++++++++----------------- 3 files changed, 50 insertions(+), 26 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 72c4c347afb7..6f8c60ac9302 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -208,6 +208,12 @@ int inode_init_always(struct super_block *sb, struct inode *inode) } EXPORT_SYMBOL(inode_init_always); +struct inode *alloc_inode_nonrcu(void) +{ + return kmem_cache_alloc(inode_cachep, GFP_KERNEL); +} +EXPORT_SYMBOL(alloc_inode_nonrcu); + void free_inode_nonrcu(struct inode *inode) { kmem_cache_free(inode_cachep, inode); diff --git a/include/linux/fs.h b/include/linux/fs.h index 7519ae003a08..872552dc5a61 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2984,6 +2984,7 @@ extern void clear_inode(struct inode *); extern void __destroy_inode(struct inode *); extern struct inode *new_inode_pseudo(struct super_block *sb); extern struct inode *new_inode(struct super_block *sb); +extern struct inode *alloc_inode_nonrcu(void); extern void free_inode_nonrcu(struct inode *inode); extern int should_remove_suid(struct dentry *); extern int file_remove_privs(struct file *); diff --git a/mm/dmem.c b/mm/dmem.c index eb6df7059cf0..50cdff98675b 100644 --- a/mm/dmem.c +++ b/mm/dmem.c @@ -17,6 +17,7 @@ #include #include #include +#include #define CREATE_TRACE_POINTS #include @@ -362,9 +363,38 @@ static int __init dmem_node_init(struct dmem_node *dnode) return 0; } +static unsigned long *dmem_bitmap_alloc(unsigned long pages, + unsigned long *static_bitmap) +{ + unsigned long *bitmap, size; + + size = BITS_TO_LONGS(pages) * sizeof(long); + if (size <= sizeof(*static_bitmap)) + bitmap = static_bitmap; + else if (size <= PAGE_SIZE) + bitmap = kzalloc(size, GFP_KERNEL); + else + bitmap = vzalloc(size); + + return bitmap; +} + +static void dmem_bitmap_free(unsigned long pages, + unsigned long *bitmap, + unsigned long *static_bitmap) +{ + unsigned long size; + + size = BITS_TO_LONGS(pages) * sizeof(long); + if (size > PAGE_SIZE) + vfree(bitmap); + else if (bitmap != static_bitmap) + kfree(bitmap); +} + static void __init dmem_region_uinit(struct dmem_region *dregion) { - unsigned long nr_pages, size, *bitmap = dregion->error_bitmap; + unsigned long nr_pages, *bitmap = dregion->error_bitmap; if (!bitmap) return; @@ -374,9 +404,7 @@ static void __init dmem_region_uinit(struct dmem_region *dregion) WARN_ON(!nr_pages); - size = BITS_TO_LONGS(nr_pages) * sizeof(long); - if (size > sizeof(dregion->static_bitmap)) - kfree(bitmap); + dmem_bitmap_free(nr_pages, bitmap, &dregion->static_error_bitmap); dregion->error_bitmap = NULL; } @@ -405,19 +433,15 @@ static void __init dmem_uinit(void) static int __init dmem_region_init(struct dmem_region *dregion) { - unsigned long *bitmap, size, nr_pages; + unsigned long *bitmap, nr_pages; nr_pages = __phys_to_pfn(dregion->reserved_end_addr) - __phys_to_pfn(dregion->reserved_start_addr); - size = BITS_TO_LONGS(nr_pages) * sizeof(long); - if (size <= sizeof(dregion->static_error_bitmap)) { - bitmap = &dregion->static_error_bitmap; - } else { - bitmap = kzalloc(size, GFP_KERNEL); - if (!bitmap) - return -ENOMEM; - } + bitmap = dmem_bitmap_alloc(nr_pages, &dregion->static_error_bitmap); + if (!bitmap) + return -ENOMEM; + dregion->error_bitmap = bitmap; return 0; } @@ -472,7 +496,7 @@ late_initcall(dmem_late_init); static int dmem_alloc_region_init(struct dmem_region *dregion, unsigned long *dpages) { - unsigned long start, end, *bitmap, size; + unsigned long start, end, *bitmap; start = DMEM_PAGE_UP(dregion->reserved_start_addr); end = DMEM_PAGE_DOWN(dregion->reserved_end_addr); @@ -481,14 +505,9 @@ static int dmem_alloc_region_init(struct dmem_region *dregion, if (!*dpages) return 0; - size = BITS_TO_LONGS(*dpages) * sizeof(long); - if (size <= sizeof(dregion->static_bitmap)) - bitmap = &dregion->static_bitmap; - else { - bitmap = kzalloc(size, GFP_KERNEL); - if (!bitmap) - return -ENOMEM; - } + bitmap = dmem_bitmap_alloc(*dpages, &dregion->static_bitmap); + if (!bitmap) + return -ENOMEM; dregion->bitmap = bitmap; dregion->next_free_pos = 0; @@ -582,7 +601,7 @@ static void dmem_uinit_check_alloc_bitmap(struct dmem_region *dregion) static void dmem_alloc_region_uinit(struct dmem_region *dregion) { - unsigned long dpages, size, *bitmap = dregion->bitmap; + unsigned long dpages, *bitmap = dregion->bitmap; if (!bitmap) return; @@ -592,9 +611,7 @@ static void dmem_alloc_region_uinit(struct dmem_region *dregion) dmem_uinit_check_alloc_bitmap(dregion); - size = BITS_TO_LONGS(dpages) * sizeof(long); - if (size > sizeof(dregion->static_bitmap)) - kfree(bitmap); + dmem_bitmap_free(dpages, bitmap, &dregion->static_bitmap); dregion->bitmap = NULL; } From patchwork Thu Oct 8 07:54:21 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822397 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 7874D13B2 for ; Thu, 8 Oct 2020 07:56:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 56EE121924 for ; Thu, 8 Oct 2020 07:56:16 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="pTfNjEfz" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729154AbgJHH4C (ORCPT ); Thu, 8 Oct 2020 03:56:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52330 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729150AbgJHH4B (ORCPT ); Thu, 8 Oct 2020 03:56:01 -0400 Received: from mail-pg1-x544.google.com (mail-pg1-x544.google.com [IPv6:2607:f8b0:4864:20::544]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6103DC0613D5; Thu, 8 Oct 2020 00:56:01 -0700 (PDT) Received: by mail-pg1-x544.google.com with SMTP id u24so3611565pgi.1; Thu, 08 Oct 2020 00:56:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=wyHDR8+dAOff2WNierEioGfQpDDNjWjuRwXu10QUWLc=; b=pTfNjEfzS3Q307/ptExM2s6iDRruR5t8Ppiks1cBPWbNNNJ+wLhGJhIHtyBSoLbZyc nnEpxQuCBkK4/o7AxLI8yK+O4CVJa2uSf1H5QEyimWnyHcKOVsg4WyINDJMcGruyWuyI 1JumNEJoWBljLOBAca3mmBfh3E8NCnqGE6jzHfvvuB0e9E7rLVRW0AMRNjRmLIgrtrMt o8kIXd14KlOXgHpPkOtiJurT73i1+QwYVa4SI8u/CVqxYyeAeqUzXQEzlJmxs23P8ktg aP8I/MWsbb5gBEd2OOLnANzGk50mRv8CLQgSyNyKOH1FUETYZZU3WpsLlDrSmW4UOGuY bbEQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=wyHDR8+dAOff2WNierEioGfQpDDNjWjuRwXu10QUWLc=; b=ZjmoSNNJmyjNeA8fAQBLP7HXYxQrosRy7mamG587eGpzedN7ZbRFBrBHc22a2SIHzG RnHsIBU88EQzZKuGtDKGLzsNxcXRWmqP5rse2EyPmRTOn7pv7Fma46HBWRnmkzm7M8XJ F4x0pNmz8RPG27KDuXgYM3CjmswUYLyTgVCF02lFIVO7znF7mc/FVUlRDEXbLPly4KpR 5UuDsntjMQTUyGwW/CFL7tbNfRzf7ZdR3mHy9OAOx2090ENkAYTb8U03Ca2/iQdK3m5i TCeui0I3Mid8/XvMSo16Xq2pZj/T85Z5Q76n2Q47XknHi4E1EM8thUnN9xzaIHgEHw2+ 5kIw== X-Gm-Message-State: AOAM533hoRvGa1OVP0j0DBnNJtiKhcHGziENMpzTiP78Omp5+ZywEjGm LYOqiHc5t0G0htUrhWoMWuw= X-Google-Smtp-Source: ABdhPJyB9HushHJdLOcJQrYX8ceiZBF8Sc+4Il+mcDTyBSIrDeG0ruPy5odkRh7QQaPgij1A/BGp8w== X-Received: by 2002:a63:4f5e:: with SMTP id p30mr6175167pgl.6.1602143760930; Thu, 08 Oct 2020 00:56:00 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.55.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:56:00 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Haiwei Li Subject: [PATCH 31/35] dmem: introduce mce handler Date: Thu, 8 Oct 2020 15:54:21 +0800 Message-Id: <6ac6ec10681d935664d6d065b8464b1a7755b674.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang dmem handle the mce if the pfn belongs to dmem when mce occurs. 1. check whether the pfn is handled by dmem. return if true. 2. mark the pfn in a new error bitmap defined in page. 3. a series of mechanism to ensure that the mce pfn is not allocated. Signed-off-by: Haiwei Li Signed-off-by: Yulei Zhang --- include/linux/dmem.h | 6 +++ include/trace/events/dmem.h | 17 ++++++ mm/dmem.c | 103 +++++++++++++++++++++++++----------- mm/memory-failure.c | 6 +++ 4 files changed, 102 insertions(+), 30 deletions(-) diff --git a/include/linux/dmem.h b/include/linux/dmem.h index 59d3ef14fe42..cd17a91a7264 100644 --- a/include/linux/dmem.h +++ b/include/linux/dmem.h @@ -21,6 +21,8 @@ dmem_alloc_pages_vma(struct vm_area_struct *vma, unsigned long addr, void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr); bool is_dmem_pfn(unsigned long pfn); #define dmem_free_page(addr) dmem_free_pages(addr, 1) + +bool dmem_memory_failure(unsigned long pfn, int flags); #else static inline int dmem_reserve_init(void) { @@ -32,5 +34,9 @@ static inline bool is_dmem_pfn(unsigned long pfn) return 0; } +static inline bool dmem_memory_failure(unsigned long pfn, int flags) +{ + return false; +} #endif #endif /* _LINUX_DMEM_H */ diff --git a/include/trace/events/dmem.h b/include/trace/events/dmem.h index 10d1b90a7783..f8eeb3c63b14 100644 --- a/include/trace/events/dmem.h +++ b/include/trace/events/dmem.h @@ -62,6 +62,23 @@ TRACE_EVENT(dmem_free_pages, TP_printk("addr %#lx dpages_nr %d", (unsigned long)__entry->addr, __entry->dpages_nr) ); + +TRACE_EVENT(dmem_memory_failure, + TP_PROTO(unsigned long pfn, bool used), + TP_ARGS(pfn, used), + + TP_STRUCT__entry( + __field(unsigned long, pfn) + __field(bool, used) + ), + + TP_fast_assign( + __entry->pfn = pfn; + __entry->used = used; + ), + + TP_printk("pfn=%#lx used=%d", __entry->pfn, __entry->used) +); #endif /* This part must be outside protection */ diff --git a/mm/dmem.c b/mm/dmem.c index 50cdff98675b..16438dbed3f5 100644 --- a/mm/dmem.c +++ b/mm/dmem.c @@ -431,6 +431,41 @@ static void __init dmem_uinit(void) dmem_pool.registered_pages = 0; } +/* set or clear corresponding bit on allocation bitmap based on error bitmap */ +static unsigned long dregion_alloc_bitmap_set_clear(struct dmem_region *dregion, + bool set) +{ + unsigned long pos_pfn, pos_offset; + unsigned long valid_pages, mce_dpages = 0; + phys_addr_t dpage, reserved_start_pfn; + + reserved_start_pfn = __phys_to_pfn(dregion->reserved_start_addr); + + valid_pages = dpage_to_pfn(dregion->dpage_end_pfn) - reserved_start_pfn; + pos_offset = dpage_to_pfn(dregion->dpage_start_pfn) + - reserved_start_pfn; +try_set: + pos_pfn = find_next_bit(dregion->error_bitmap, valid_pages, pos_offset); + + if (pos_pfn >= valid_pages) + return mce_dpages; + mce_dpages++; + dpage = pfn_to_dpage(pos_pfn + reserved_start_pfn); + if (set) + WARN_ON(__test_and_set_bit(dpage - dregion->dpage_start_pfn, + dregion->bitmap)); + else + WARN_ON(!__test_and_clear_bit(dpage - dregion->dpage_start_pfn, + dregion->bitmap)); + pos_offset = dpage_to_pfn(dpage + 1) - reserved_start_pfn; + goto try_set; +} + +static unsigned long dmem_region_mark_mce_dpages(struct dmem_region *dregion) +{ + return dregion_alloc_bitmap_set_clear(dregion, true); +} + static int __init dmem_region_init(struct dmem_region *dregion) { unsigned long *bitmap, nr_pages; @@ -514,6 +549,8 @@ static int dmem_alloc_region_init(struct dmem_region *dregion, dregion->dpage_start_pfn = start; dregion->dpage_end_pfn = end; + *dpages -= dmem_region_mark_mce_dpages(dregion); + dmem_pool.unaligned_pages += __phys_to_pfn((dpage_to_phys(start) - dregion->reserved_start_addr)); dmem_pool.unaligned_pages += __phys_to_pfn(dregion->reserved_end_addr @@ -558,36 +595,6 @@ dmem_alloc_bitmap_clear(struct dmem_region *dregion, phys_addr_t dpage, return err_num; } -/* set or clear corresponding bit on allocation bitmap based on error bitmap */ -static unsigned long dregion_alloc_bitmap_set_clear(struct dmem_region *dregion, - bool set) -{ - unsigned long pos_pfn, pos_offset; - unsigned long valid_pages, mce_dpages = 0; - phys_addr_t dpage, reserved_start_pfn; - - reserved_start_pfn = __phys_to_pfn(dregion->reserved_start_addr); - - valid_pages = dpage_to_pfn(dregion->dpage_end_pfn) - reserved_start_pfn; - pos_offset = dpage_to_pfn(dregion->dpage_start_pfn) - - reserved_start_pfn; -try_set: - pos_pfn = find_next_bit(dregion->error_bitmap, valid_pages, pos_offset); - - if (pos_pfn >= valid_pages) - return mce_dpages; - mce_dpages++; - dpage = pfn_to_dpage(pos_pfn + reserved_start_pfn); - if (set) - WARN_ON(__test_and_set_bit(dpage - dregion->dpage_start_pfn, - dregion->bitmap)); - else - WARN_ON(!__test_and_clear_bit(dpage - dregion->dpage_start_pfn, - dregion->bitmap)); - pos_offset = dpage_to_pfn(dpage + 1) - reserved_start_pfn; - goto try_set; -} - static void dmem_uinit_check_alloc_bitmap(struct dmem_region *dregion) { unsigned long dpages, size; @@ -989,6 +996,42 @@ void dmem_free_pages(phys_addr_t addr, unsigned int dpages_nr) } EXPORT_SYMBOL(dmem_free_pages); +bool dmem_memory_failure(unsigned long pfn, int flags) +{ + struct dmem_region *dregion; + struct dmem_node *pdnode = NULL; + u64 pos; + phys_addr_t addr = __pfn_to_phys(pfn); + bool used = false; + + dregion = find_dmem_region(addr, &pdnode); + if (!dregion) + return false; + + WARN_ON(!pdnode || !dregion->error_bitmap); + + mutex_lock(&dmem_pool.lock); + pos = pfn - __phys_to_pfn(dregion->reserved_start_addr); + if (__test_and_set_bit(pos, dregion->error_bitmap)) + goto out; + + if (!dregion->bitmap || pfn < dpage_to_pfn(dregion->dpage_start_pfn) || + pfn >= dpage_to_pfn(dregion->dpage_end_pfn)) + goto out; + + pos = phys_to_dpage(addr) - dregion->dpage_start_pfn; + if (__test_and_set_bit(pos, dregion->bitmap)) { + used = true; + } else { + pr_info("MCE: free dpage, mark %#lx disabled in dmem\n", pfn); + dnode_count_free_dpages(pdnode, -1); + } +out: + trace_dmem_memory_failure(pfn, used); + mutex_unlock(&dmem_pool.lock); + return true; +} + bool is_dmem_pfn(unsigned long pfn) { struct dmem_node *dnode; diff --git a/mm/memory-failure.c b/mm/memory-failure.c index f1aa6433f404..c613e1ec5995 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -35,6 +35,7 @@ */ #include #include +#include #include #include #include @@ -1280,6 +1281,11 @@ int memory_failure(unsigned long pfn, int flags) if (!sysctl_memory_failure_recovery) panic("Memory failure on page %lx", pfn); + if (dmem_memory_failure(pfn, flags)) { + pr_info("MCE %#lx: handled by dmem\n", pfn); + return 0; + } + p = pfn_to_online_page(pfn); if (!p) { if (pfn_valid(pfn)) { From patchwork Thu Oct 8 07:54:22 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822419 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 6333C13B2 for ; Thu, 8 Oct 2020 07:56:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2D61721924 for ; Thu, 8 Oct 2020 07:56:51 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="g/c1RfyQ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729114AbgJHH4s (ORCPT ); Thu, 8 Oct 2020 03:56:48 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52346 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729181AbgJHH4F (ORCPT ); Thu, 8 Oct 2020 03:56:05 -0400 Received: from mail-pf1-x443.google.com (mail-pf1-x443.google.com [IPv6:2607:f8b0:4864:20::443]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 97AB7C0613D4; Thu, 8 Oct 2020 00:56:05 -0700 (PDT) Received: by mail-pf1-x443.google.com with SMTP id q123so3341111pfb.0; Thu, 08 Oct 2020 00:56:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=OLe1dASqSvtsHG3v+61p4SLPbRY7HjOgB0mEYZxCxAI=; b=g/c1RfyQfTctEsdD5SmnXHkCDGzHjroOi/DbxmEUpnAITeK/LPW9KZUJoO7mkHb/y3 mZtJq0grl1emEQuk4pcrdm8OLAAwsH/nrsbhZoNTPRh0LlFK7HCzfRZWSVNaI54cEOcE M1lfmHdISf1mPEALStYpamvx3k1KXvl8GLSQam2QbhdT0JWmHUHUt9BurHxCLfmkh2wf aXO9ZztwfHK4M+0Z07UbjgTyQZPUqSBpDG+0WC1YMz9zNqERcVrbQmoSLNOE7upNCrge SgSpy6mRrMu9d7t4PEmb3cmIMipsXP5B8o7EX4cv5vymKl66jE1lFSuDEbGV56wjYt3s RU4w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=OLe1dASqSvtsHG3v+61p4SLPbRY7HjOgB0mEYZxCxAI=; b=MelqnQvUEfH+86fFQrOTs/5/66ng9LD7YKLkRtbe7F0DFZ9TAPCJymlAQu1esD2J8r u2/HWOR5GpSQjj86lTpVD1BrWcKb8eYjIbYKglthaMy86/RFLrjet5uzqBiDbQMTHChw ncP7YRBdA5ij2dWCe2j8eSqRjtOtQnwi3glzSS6s4liJFPuHOruX4vf51UWBWnqAavN8 XSHNg5P/7yhQCcIjNQLpG00tyh1PWC4SAFefnktMCY14YD/CjvN5ERsJiFtz3rDTQVbw ZmchfF6Z5TwjDu0cx7YAGMmo0YQOZermgXhScvj7dMXY2pEmwXmG0iCgDcXvCG4n6QnK T1rA== X-Gm-Message-State: AOAM533R8vxt/uuyhyuEMqd8dl7dow16fyllEkUvgdOd2NYY0D6wxjK4 mPcpC8btDIE0B5dq7dj8s9s= X-Google-Smtp-Source: ABdhPJzSs0Zw1OPxv0SQRoQiZPP8wljYqksUhWu5P9q85AvyJ4tuA++0Bz7VMPtsz9l1RRQPdhaV3A== X-Received: by 2002:a17:90a:a88:: with SMTP id 8mr6847557pjw.105.1602143765132; Thu, 08 Oct 2020 00:56:05 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.56.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:56:04 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Haiwei Li Subject: [PATCH 32/35] mm, dmemfs: register and handle the dmem mce Date: Thu, 8 Oct 2020 15:54:22 +0800 Message-Id: <39d94616fd92ab9a1bb989c928783a7060df1fc7.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang dmemfs register the mce handler, send signal to the procs whose vma is mapped in mce pfn. Signed-off-by: Haiwei Li Signed-off-by: Yulei Zhang --- fs/dmemfs/inode.c | 141 +++++++++++++++++++++++++++++++++++++++++++ include/linux/dmem.h | 7 +++ include/linux/mm.h | 2 + mm/dmem.c | 34 +++++++++++ mm/memory-failure.c | 63 ++++++++++++++----- 5 files changed, 231 insertions(+), 16 deletions(-) diff --git a/fs/dmemfs/inode.c b/fs/dmemfs/inode.c index 027428a7f7a0..adfceff98636 100644 --- a/fs/dmemfs/inode.c +++ b/fs/dmemfs/inode.c @@ -36,6 +36,47 @@ MODULE_LICENSE("GPL v2"); static uint __read_mostly max_alloc_try_dpages = 1; +struct dmemfs_inode { + struct inode *inode; + struct list_head link; +}; + +static LIST_HEAD(dmemfs_inode_list); +static DEFINE_SPINLOCK(dmemfs_inode_lock); + +static struct dmemfs_inode * +dmemfs_create_dmemfs_inode(struct inode *inode) +{ + struct dmemfs_inode *dmemfs_inode; + + spin_lock(&dmemfs_inode_lock); + dmemfs_inode = kmalloc(sizeof(struct dmemfs_inode), GFP_NOIO); + if (!dmemfs_inode) { + pr_err("DMEMFS: Out of memory while getting dmemfs inode\n"); + goto out; + } + dmemfs_inode->inode = inode; + list_add_tail(&dmemfs_inode->link, &dmemfs_inode_list); +out: + spin_unlock(&dmemfs_inode_lock); + return dmemfs_inode; +} + +static void dmemfs_delete_dmemfs_inode(struct inode *inode) +{ + struct dmemfs_inode *i, *next; + + spin_lock(&dmemfs_inode_lock); + list_for_each_entry_safe(i, next, &dmemfs_inode_list, link) { + if (i->inode == inode) { + list_del(&i->link); + kfree(i); + break; + } + } + spin_unlock(&dmemfs_inode_lock); +} + struct dmemfs_mount_opts { unsigned long dpage_size; }; @@ -221,6 +262,13 @@ static unsigned long dmem_pgoff_to_index(struct inode *inode, pgoff_t pgoff) return pgoff >> (sb->s_blocksize_bits - PAGE_SHIFT); } +static pgoff_t dmem_index_to_pgoff(struct inode *inode, unsigned long index) +{ + struct super_block *sb = inode->i_sb; + + return index << (sb->s_blocksize_bits - PAGE_SHIFT); +} + static void *dmem_addr_to_entry(struct inode *inode, phys_addr_t addr) { struct super_block *sb = inode->i_sb; @@ -809,6 +857,23 @@ static void dmemfs_evict_inode(struct inode *inode) clear_inode(inode); } +static struct inode *dmemfs_alloc_inode(struct super_block *sb) +{ + struct inode *inode; + + inode = alloc_inode_nonrcu(); + if (inode) + dmemfs_create_dmemfs_inode(inode); + return inode; +} + +static void dmemfs_destroy_inode(struct inode *inode) +{ + if (inode) + dmemfs_delete_dmemfs_inode(inode); + free_inode_nonrcu(inode); +} + /* * Display the mount options in /proc/mounts. */ @@ -822,9 +887,11 @@ static int dmemfs_show_options(struct seq_file *m, struct dentry *root) } static const struct super_operations dmemfs_ops = { + .alloc_inode = dmemfs_alloc_inode, .statfs = dmemfs_statfs, .evict_inode = dmemfs_evict_inode, .drop_inode = generic_delete_inode, + .destroy_inode = dmemfs_destroy_inode, .show_options = dmemfs_show_options, }; @@ -904,17 +971,91 @@ static struct file_system_type dmemfs_fs_type = { .kill_sb = dmemfs_kill_sb, }; +static struct inode * +dmemfs_find_inode_by_addr(phys_addr_t addr, pgoff_t *pgoff) +{ + struct dmemfs_inode *di; + struct inode *inode; + struct address_space *mapping; + void *entry, **slot; + void *mce_entry; + + list_for_each_entry(di, &dmemfs_inode_list, link) { + inode = di->inode; + mapping = inode->i_mapping; + mce_entry = dmem_addr_to_entry(inode, addr); + XA_STATE(xas, &mapping->i_pages, 0); + rcu_read_lock(); + + xas_for_each(&xas, entry, ULONG_MAX) { + if (xas_retry(&xas, entry)) + continue; + + if (unlikely(entry != xas_reload(&xas))) + goto retry; + + if (mce_entry != entry) + continue; + *pgoff = dmem_index_to_pgoff(inode, xas.xa_index); + rcu_read_unlock(); + return inode; +retry: + xas_reset(&xas); + } + rcu_read_unlock(); + } + return NULL; +} + +static int dmemfs_mce_handler(struct notifier_block *this, unsigned long pfn, + void *v) +{ + struct dmem_mce_notifier_info *info = + (struct dmem_mce_notifier_info *)v; + int flags = info->flags; + struct inode *inode; + phys_addr_t mce_addr = __pfn_to_phys(pfn); + pgoff_t pgoff; + + spin_lock(&dmemfs_inode_lock); + inode = dmemfs_find_inode_by_addr(mce_addr, &pgoff); + if (!inode || !atomic_read(&inode->i_count)) + goto out; + + collect_procs_and_signal_inode(inode, pgoff, pfn, flags); +out: + spin_unlock(&dmemfs_inode_lock); + return 0; +} + +static struct notifier_block dmemfs_mce_notifier = { + .notifier_call = dmemfs_mce_handler, +}; + static int __init dmemfs_init(void) { int ret; + pr_info("dmemfs initialized\n"); ret = register_filesystem(&dmemfs_fs_type); + if (ret) + goto reg_fs_fail; + + ret = dmem_register_mce_notifier(&dmemfs_mce_notifier); + if (ret) + goto reg_notifier_fail; + return 0; + +reg_notifier_fail: + unregister_filesystem(&dmemfs_fs_type); +reg_fs_fail: return ret; } static void __exit dmemfs_uninit(void) { + dmem_unregister_mce_notifier(&dmemfs_mce_notifier); unregister_filesystem(&dmemfs_fs_type); } diff --git a/include/linux/dmem.h b/include/linux/dmem.h index cd17a91a7264..fe0b270ef1e5 100644 --- a/include/linux/dmem.h +++ b/include/linux/dmem.h @@ -23,6 +23,13 @@ bool is_dmem_pfn(unsigned long pfn); #define dmem_free_page(addr) dmem_free_pages(addr, 1) bool dmem_memory_failure(unsigned long pfn, int flags); + +struct dmem_mce_notifier_info { + int flags; +}; + +int dmem_register_mce_notifier(struct notifier_block *nb); +int dmem_unregister_mce_notifier(struct notifier_block *nb); #else static inline int dmem_reserve_init(void) { diff --git a/include/linux/mm.h b/include/linux/mm.h index 7b1e574d2387..ff0b12320ca1 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3006,6 +3006,8 @@ extern int memory_failure(unsigned long pfn, int flags); extern void memory_failure_queue(unsigned long pfn, int flags); extern void memory_failure_queue_kick(int cpu); extern int unpoison_memory(unsigned long pfn); +extern void collect_procs_and_signal_inode(struct inode *inode, pgoff_t pgoff, + unsigned long pfn, int flags); extern int get_hwpoison_page(struct page *page); #define put_hwpoison_page(page) put_page(page) extern int sysctl_memory_failure_early_kill; diff --git a/mm/dmem.c b/mm/dmem.c index 16438dbed3f5..dd81b2483696 100644 --- a/mm/dmem.c +++ b/mm/dmem.c @@ -70,6 +70,7 @@ struct dmem_node { struct dmem_pool { struct mutex lock; + struct raw_notifier_head mce_notifier_chain; unsigned long region_num; unsigned long registered_pages; @@ -92,6 +93,7 @@ struct dmem_pool { static struct dmem_pool dmem_pool = { .lock = __MUTEX_INITIALIZER(dmem_pool.lock), + .mce_notifier_chain = RAW_NOTIFIER_INIT(dmem_pool.mce_notifier_chain), }; #define DMEM_PAGE_SIZE (1UL << dmem_pool.dpage_shift) @@ -121,6 +123,35 @@ static struct dmem_pool dmem_pool = { #define for_each_dmem_region(_dnode, _dregion) \ list_for_each_entry(_dregion, &(_dnode)->regions, node) +int dmem_register_mce_notifier(struct notifier_block *nb) +{ + int ret; + + mutex_lock(&dmem_pool.lock); + ret = raw_notifier_chain_register(&dmem_pool.mce_notifier_chain, nb); + mutex_unlock(&dmem_pool.lock); + return ret; +} +EXPORT_SYMBOL(dmem_register_mce_notifier); + +int dmem_unregister_mce_notifier(struct notifier_block *nb) +{ + int ret; + + mutex_lock(&dmem_pool.lock); + ret = raw_notifier_chain_unregister(&dmem_pool.mce_notifier_chain, nb); + mutex_unlock(&dmem_pool.lock); + return ret; +} +EXPORT_SYMBOL(dmem_unregister_mce_notifier); + +static int dmem_mce_notify(unsigned long pfn, + struct dmem_mce_notifier_info *info) +{ + return raw_notifier_call_chain(&dmem_pool.mce_notifier_chain, + pfn, info); +} + static inline int *dmem_nodelist(int nid) { return nid_to_dnode(nid)->nodelist; @@ -1003,6 +1034,7 @@ bool dmem_memory_failure(unsigned long pfn, int flags) u64 pos; phys_addr_t addr = __pfn_to_phys(pfn); bool used = false; + struct dmem_mce_notifier_info info; dregion = find_dmem_region(addr, &pdnode); if (!dregion) @@ -1022,6 +1054,8 @@ bool dmem_memory_failure(unsigned long pfn, int flags) pos = phys_to_dpage(addr) - dregion->dpage_start_pfn; if (__test_and_set_bit(pos, dregion->bitmap)) { used = true; + info.flags = flags; + dmem_mce_notify(pfn, &info); } else { pr_info("MCE: free dpage, mark %#lx disabled in dmem\n", pfn); dnode_count_free_dpages(pdnode, -1); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index c613e1ec5995..cdd3cd77edbc 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -307,8 +307,8 @@ static unsigned long dev_pagemap_mapping_shift(struct page *page, * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM. */ static void add_to_kill(struct task_struct *tsk, struct page *p, - struct vm_area_struct *vma, - struct list_head *to_kill) + struct vm_area_struct *vma, unsigned long pfn, + pgoff_t pgoff, struct list_head *to_kill) { struct to_kill *tk; @@ -318,12 +318,17 @@ static void add_to_kill(struct task_struct *tsk, struct page *p, return; } - tk->addr = page_address_in_vma(p, vma); - if (is_zone_device_page(p)) - tk->size_shift = dev_pagemap_mapping_shift(p, vma); - else - tk->size_shift = page_shift(compound_head(p)); - + if (p) { + tk->addr = page_address_in_vma(p, vma); + if (is_zone_device_page(p)) + tk->size_shift = dev_pagemap_mapping_shift(p, vma); + else + tk->size_shift = page_shift(compound_head(p)); + } else { + tk->size_shift = PAGE_SHIFT; + tk->addr = vma->vm_start + + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT); + } /* * Send SIGKILL if "tk->addr == -EFAULT". Also, as * "tk->size_shift" is always non-zero for !is_zone_device_page(), @@ -336,7 +341,7 @@ static void add_to_kill(struct task_struct *tsk, struct page *p, */ if (tk->addr == -EFAULT) { pr_info("Memory failure: Unable to find user space address %lx in %s\n", - page_to_pfn(p), tsk->comm); + pfn, tsk->comm); } else if (tk->size_shift == 0) { kfree(tk); return; @@ -469,7 +474,8 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill, if (!page_mapped_in_vma(page, vma)) continue; if (vma->vm_mm == t->mm) - add_to_kill(t, page, vma, to_kill); + add_to_kill(t, page, vma, page_to_pfn(page), + page_to_pgoff(page), to_kill); } } read_unlock(&tasklist_lock); @@ -477,19 +483,18 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill, } /* - * Collect processes when the error hit a file mapped page. + * Collect processes when the error hit a file mapped memory. */ -static void collect_procs_file(struct page *page, struct list_head *to_kill, - int force_early) +static void __collect_procs_file(struct address_space *mapping, pgoff_t pgoff, + struct page *page, unsigned long pfn, + struct list_head *to_kill, int force_early) { struct vm_area_struct *vma; struct task_struct *tsk; - struct address_space *mapping = page->mapping; i_mmap_lock_read(mapping); read_lock(&tasklist_lock); for_each_process(tsk) { - pgoff_t pgoff = page_to_pgoff(page); struct task_struct *t = task_early_kill(tsk, force_early); if (!t) @@ -504,13 +509,39 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill, * to be informed of all such data corruptions. */ if (vma->vm_mm == t->mm) - add_to_kill(t, page, vma, to_kill); + add_to_kill(t, page, vma, pfn, pgoff, to_kill); } } read_unlock(&tasklist_lock); i_mmap_unlock_read(mapping); } +/* + * Collect processes when the error hit a file mapped page. + */ +static void collect_procs_file(struct page *page, struct list_head *to_kill, + int force_early) +{ + struct address_space *mapping = page->mapping; + + __collect_procs_file(mapping, page_to_pgoff(page), page, + page_to_pfn(page), to_kill, force_early); +} + +void collect_procs_and_signal_inode(struct inode *inode, pgoff_t pgoff, + unsigned long pfn, int flags) +{ + int forcekill; + struct address_space *mapping = &inode->i_data; + LIST_HEAD(tokill); + + __collect_procs_file(mapping, pgoff, NULL, pfn, &tokill, + flags & MF_ACTION_REQUIRED); + forcekill = flags & MF_MUST_KILL; + kill_procs(&tokill, forcekill, false, pfn, flags); +} +EXPORT_SYMBOL(collect_procs_and_signal_inode); + /* * Collect the processes who have the corrupted page mapped to kill. */ From patchwork Thu Oct 8 07:54:23 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822415 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B250413B2 for ; Thu, 8 Oct 2020 07:56:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 892C621924 for ; Thu, 8 Oct 2020 07:56:43 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="ahiMLnOI" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729123AbgJHH41 (ORCPT ); Thu, 8 Oct 2020 03:56:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52360 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729189AbgJHH4K (ORCPT ); Thu, 8 Oct 2020 03:56:10 -0400 Received: from mail-pl1-x643.google.com (mail-pl1-x643.google.com [IPv6:2607:f8b0:4864:20::643]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F003FC0613D5; Thu, 8 Oct 2020 00:56:09 -0700 (PDT) Received: by mail-pl1-x643.google.com with SMTP id y20so2352412pll.12; Thu, 08 Oct 2020 00:56:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=SHquDGzVelfsjqJkv/6UwKJoMQiz81lYZE6AtoNA58c=; b=ahiMLnOI29q8bCutF5aZx2p+0pSJS8sKbTXkmJLjGHz1ZrpVDJnUB7Z1xx6vmuQ7dX IAMpe242UlgNgGtsdN1aPLiBIE02P8EM7bb5PrYnyEydUwtxf/IyZwqIaKMywamKJtdo 1mCGIXwTbZQZs7bApmbBJrED7G55v5hT5u/oKzEn5d8vUMUyqyYai0WoRGrYNfarfUCI xNGKfTpco5fo66xN4o18GluAGNX9/reuTwavH/4f7dao5pyzIaKFVYHlrOBCoPdGCWeL B6Hx+EbLkm+r5hion2bKa39w72uf7GrOGLJFMY1WipZgrhz6LMWrQbp4JdGIFEmLrZoS p5fw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=SHquDGzVelfsjqJkv/6UwKJoMQiz81lYZE6AtoNA58c=; b=Ztf/9wfVpw0yiLILE2pNKCZHoKbSkmeRhRR1mfk0eX+loot7A2D+6hOx8qqx/FNJBi Grw6wxjAtmG9qMRRlwSTWAev78qVq9fXQFErPnnCm71JqEIAqQpW3RjyX2ufAfJuG8Vd Q4Wb4motShuCTTVs6E6bu50fxE46sDmigWoCaj4SJw32g+2po4UCvgXtf/By5aqzmL8O OM0C83XIZ+n0R5MKJa3AxiYLNaT2ewyA84m8yv6+ddsgAK/b1u+IDT0x71uzhrFA+ecf r1HFpA2HNBqsPDcm6X8NCl1fQRbiMxd6buyotGiu6iXxRPyQhxR4ZyoZPhjYo1riOzgC pcqQ== X-Gm-Message-State: AOAM532+U2j/0u4ihA0OjrqphzYm0VlctKfIolh4q7pDc+IIURERH1yJ fgtnObbT6vmVxiyB+rrJPMg= X-Google-Smtp-Source: ABdhPJwAwTpZv8MKyuuOm5FecRpKtxAGzN7uciTS5CimXac0Dl8WRCaZm+3psM+6v5QmKjwCJ8SwDg== X-Received: by 2002:a17:902:d710:b029:d3:7e54:96d8 with SMTP id w16-20020a170902d710b02900d37e5496d8mr6544962ply.65.1602143769597; Thu, 08 Oct 2020 00:56:09 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.56.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:56:09 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang Subject: [PATCH 33/35] kvm, x86: temporary disable record_steal_time for dmem Date: Thu, 8 Oct 2020 15:54:23 +0800 Message-Id: X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang Temporarily disable record_steal_time when entering the guest for dmem. Signed-off-by: Yulei Zhang --- arch/x86/kvm/x86.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1994602a0851..409b5a68aa60 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -2789,6 +2789,7 @@ static void kvm_vcpu_flush_tlb_guest(struct kvm_vcpu *vcpu) static void record_steal_time(struct kvm_vcpu *vcpu) { +#if 0 struct kvm_host_map map; struct kvm_steal_time *st; @@ -2830,6 +2831,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu) st->version += 1; kvm_unmap_gfn(vcpu, &map, &vcpu->arch.st.cache, true, false); +#endif } int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) From patchwork Thu Oct 8 07:54:24 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822407 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0AD8E109B for ; Thu, 8 Oct 2020 07:56:35 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CC9B621924 for ; Thu, 8 Oct 2020 07:56:34 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="K1aaLzfU" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729240AbgJHH4d (ORCPT ); Thu, 8 Oct 2020 03:56:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52412 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729213AbgJHH41 (ORCPT ); Thu, 8 Oct 2020 03:56:27 -0400 Received: from mail-pg1-x544.google.com (mail-pg1-x544.google.com [IPv6:2607:f8b0:4864:20::544]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 51000C0613D6; Thu, 8 Oct 2020 00:56:14 -0700 (PDT) Received: by mail-pg1-x544.google.com with SMTP id 7so3568212pgm.11; Thu, 08 Oct 2020 00:56:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=2lAuZY+zAFWdjZI2zWvNi356b2sKzGmvUf75gafHJXM=; b=K1aaLzfUCrptJTIAZl158Rd5hRyDuRIyJIxBl985m5sCESH2tj90eMvSsReceCA6TR Yeyc2NMcNS/zJolMIG1kJT3wKshzQqVuUdGGJTPr6SjInjxMGVRnN/g/7Y7N4PDliwBu zPMhbYUtSsWz6J1SEKgK/pQiFKgUDPedA7LnRznIfSlLOvGgEGerKYsLpgXfLaxqGFis f5/LQRHmY4+a1CySWvymELU0l1FGeGoXgBY8X//z25l8HAZfXudRBwYrv2W5+QT/0xwt sMQ649AaGUIuTVfiHb35yR2/aWNfvvlyGZ1tn0tqP4+JwixwQnVjPRZTadCw2d863DFZ xqdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=2lAuZY+zAFWdjZI2zWvNi356b2sKzGmvUf75gafHJXM=; b=r0H6uryk1WUVo/OprHIT0zwH6G/kz/uul8t7Ld/P/GOnPn3GvNH7sgVSxVPqO3/tRv dH9pjT/BdBslPdq4P+8uQUNFrqyL3VXiUrmqFviwGsqOljEOtumFagEl7MwovcguFfsM 44rxzwa+LhjBOZxKB99JZKozg49rW4E6ZKFReNoU45Md7eOaOyos2Y1/uLP2v6dBuSOj J93wIeOGAPFTwzac6tREB7XPYfFJa+HKWkaJmML04hA6RT3EX48634ou90Kfitc2q558 vFdZZrg3ERerU5ZGeGrDtPDb4vFGgIHAFcU18d0IIGgy8yJdZR9G0a+ImIsVMQAwg4u2 eE3A== X-Gm-Message-State: AOAM532O50aHMwmEfOcO5vMzru3H0EKGDTruRK1RWOudXUTXxMXZKNv3 oCe4vNiACmCsvLeZ2a+eQXWxnHlLYZ0CNw== X-Google-Smtp-Source: ABdhPJzKACFEvYxYVvHX+wvfLK7h21I4Puwri8PgdyyDOKDlutYOAq9/UWwxrpOiqNKwCL2xhAv1aw== X-Received: by 2002:a63:2145:: with SMTP id s5mr5994447pgm.288.1602143773888; Thu, 08 Oct 2020 00:56:13 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.56.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:56:13 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang , Xiao Guangrong Subject: [PATCH 34/35] dmem: add dmem unit tests Date: Thu, 8 Oct 2020 15:54:24 +0800 Message-Id: <0c0e00b2d89079714eb33fc3260a7d23518cb8ea.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang This test case is used to test dmem management system. Signed-off-by: Xiao Guangrong Signed-off-by: Yulei Zhang --- tools/testing/dmem/Kbuild | 1 + tools/testing/dmem/Makefile | 10 ++ tools/testing/dmem/dmem-test.c | 184 +++++++++++++++++++++++++++++++++ 3 files changed, 195 insertions(+) create mode 100644 tools/testing/dmem/Kbuild create mode 100644 tools/testing/dmem/Makefile create mode 100644 tools/testing/dmem/dmem-test.c diff --git a/tools/testing/dmem/Kbuild b/tools/testing/dmem/Kbuild new file mode 100644 index 000000000000..04988f7c76b7 --- /dev/null +++ b/tools/testing/dmem/Kbuild @@ -0,0 +1 @@ +obj-m += dmem-test.o diff --git a/tools/testing/dmem/Makefile b/tools/testing/dmem/Makefile new file mode 100644 index 000000000000..21f141f585de --- /dev/null +++ b/tools/testing/dmem/Makefile @@ -0,0 +1,10 @@ +KDIR ?= ../../../ + +default: + $(MAKE) -C $(KDIR) M=$$PWD + +install: default + $(MAKE) -C $(KDIR) M=$$PWD modules_install + +clean: + rm -f *.o *.ko Module.* modules.* *.mod.c diff --git a/tools/testing/dmem/dmem-test.c b/tools/testing/dmem/dmem-test.c new file mode 100644 index 000000000000..4baae18b593e --- /dev/null +++ b/tools/testing/dmem/dmem-test.c @@ -0,0 +1,184 @@ +/* + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ +#include +#include +#include +#include +#include +#include +#include +#include + +struct dmem_mem_node { + struct list_head node; +}; + +static LIST_HEAD(dmem_list); + +static int dmem_test_alloc_init(unsigned long dpage_shift) +{ + int ret; + + ret = dmem_alloc_init(dpage_shift); + if (ret) + pr_info("dmem_alloc_init failed, dpage_shift %ld ret=%d\n", + dpage_shift, ret); + return ret; +} + +static int __dmem_test_alloc(int order, int nid, nodemask_t *nodemask, + const char *caller) +{ + struct dmem_mem_node *pos; + phys_addr_t addr; + int i, ret = 0; + + for (i = 0; i < (1 << order); i++) { + addr = dmem_alloc_pages_nodemask(nid, nodemask, 1, NULL); + if (!addr) { + ret = -ENOMEM; + break; + } + + pos = __va(addr); + list_add(&pos->node, &dmem_list); + } + + pr_info("%s: alloc order %d on node %d has fallback node %s... %s.\n", + caller, order, nid, nodemask ? "yes" : "no", + !ret ? "okay" : "failed"); + + return ret; +} + +static void dmem_test_free_all(void) +{ + struct dmem_mem_node *pos, *n; + + list_for_each_entry_safe(pos, n, &dmem_list, node) { + list_del(&pos->node); + dmem_free_page(__pa(pos)); + } +} + +#define dmem_test_alloc(order, nid, nodemask) \ + __dmem_test_alloc(order, nid, nodemask, __func__) + +/* dmem shoud have 2^6 native pages available at lest */ +static int order_test(void) +{ + int order, i, ret; + int page_orders[] = {0, 1, 2, 3, 4, 5, 6}; + + ret = dmem_test_alloc_init(PAGE_SHIFT); + if (ret) + return ret; + + for (i = 0; i < ARRAY_SIZE(page_orders); i++) { + order = page_orders[i]; + + ret = dmem_test_alloc(order, numa_node_id(), NULL); + if (ret) + break; + } + + dmem_test_free_all(); + + dmem_alloc_uinit(); + + return ret; +} + +static int node_test(void) +{ + nodemask_t nodemask; + unsigned long nr = 0; + int order; + int node; + int ret = 0; + + order = 0; + + ret = dmem_test_alloc_init(PUD_SHIFT); + if (ret) + return ret; + + pr_info("%s: test allocation on node 0\n", __func__); + node = 0; + nodes_clear(nodemask); + node_set(0, nodemask); + + ret = dmem_test_alloc(order, node, &nodemask); + if (ret) + goto exit; + + dmem_test_free_all(); + + pr_info("%s: begin to exhaust dmem on node 0.\n", __func__); + node = 1; + nodes_clear(nodemask); + node_set(0, nodemask); + + INIT_LIST_HEAD(&dmem_list); + while (!(ret = dmem_test_alloc(order, node, &nodemask))) + nr++; + + pr_info("Allocation on node 0 success times: %lu\n", nr); + + pr_info("%s: allocation on node 0 again\n", __func__); + node = 0; + nodes_clear(nodemask); + node_set(0, nodemask); + ret = dmem_test_alloc(order, node, &nodemask); + if (!ret) { + pr_info("\tNot expected fallback\n"); + ret = -1; + } else { + ret = 0; + pr_info("\tOK, Dmem on node 0 exhausted, fallback success\n"); + } + + pr_info("%s: Release dmem\n", __func__); + dmem_test_free_all(); + +exit: + dmem_alloc_uinit(); + return ret; +} + +static __init int dmem_test_init(void) +{ + int ret; + + pr_info("dmem: test init...\n"); + + ret = order_test(); + if (ret) + return ret; + + ret = node_test(); + + + if (ret) + pr_info("dmem test fail, ret=%d\n", ret); + else + pr_info("dmem test success\n"); + return ret; +} + +static __exit void dmem_test_exit(void) +{ + pr_info("dmem: test exit...\n"); +} + +module_init(dmem_test_init); +module_exit(dmem_test_exit); +MODULE_LICENSE("GPL v2"); From patchwork Thu Oct 8 07:54:25 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: yulei zhang X-Patchwork-Id: 11822413 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id A0B961580 for ; Thu, 8 Oct 2020 07:56:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7EA392083B for ; Thu, 8 Oct 2020 07:56:41 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="sD5at9h1" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729203AbgJHH4d (ORCPT ); Thu, 8 Oct 2020 03:56:33 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52422 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729214AbgJHH41 (ORCPT ); Thu, 8 Oct 2020 03:56:27 -0400 Received: from mail-pg1-x543.google.com (mail-pg1-x543.google.com [IPv6:2607:f8b0:4864:20::543]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9A253C0613DC; Thu, 8 Oct 2020 00:56:18 -0700 (PDT) Received: by mail-pg1-x543.google.com with SMTP id 34so3563537pgo.13; Thu, 08 Oct 2020 00:56:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=2BnfFJzyK/ObYmFl6qjjvD95MAX5noA3ThIztEgoaE0=; b=sD5at9h1ul8tK2eGhC10aT7eu1ACuvM8pX0UClSoOBqsWnd8ZLKBG21Or5ujlsSO4r ZcSP94jhTlJ88RZsOP3erYbPjNfZZ+XJV13wDvViB10BsEc7+wwzeX5ngETPK0AzznVh WhaVc55dAnhOG0gv1GJzH2TCaJ1rhgbsk4LmqTeQVRFjQ3hIczSrI6SQqNvwPnABDU6s FP8l4ZgVXEIoX5hqiBtlmgXrH9XlMNG3EtoY0tPObqBPsqr62osd5A86auhcTjL7TLrZ +dJRfZ8CQc6eSephIBvs2EBuVvTMTEGtKRR5UbvkYVR5BkJe9ido8lbljwqBnCt3qfBE 8Wew== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=2BnfFJzyK/ObYmFl6qjjvD95MAX5noA3ThIztEgoaE0=; b=uLOzqg+GDrzZtbnXkRSpesLhB9psHVG7yXFyTU6zzwyUsGBI+9pTUMzwBlTFOBr2Gg wgACGHqDbr7zwJZvw/v4VrKFoZI6CjIUqcsCK0cyvnponme+5qxRo1nxTQRbB4ZNJlmL jsbMOWhcBC06Q7aFRjqoAX1mCzhapZDvivKZy6VZLTfNwUzbfhJJQdkVi77FttbgbY7H 1B/plINpLg+OVrUTsy6Mk0XbxWKv+c1CAdEHiwz1KUvSZHe+SmGkY6vpcBuhPgyb6xh5 Qq6nwQYkVGOC0YmDr/3bRNnJdIXfY2xST0wjtJwTAwF27ZCzYwbhtelX/G/vJMTYF9Yd 8caQ== X-Gm-Message-State: AOAM533P9+stwrENgcd1fzgHYHCs3znk+Z0T2CNoTquZ8UG7XYGYoDFB w9F9QxA1paD3LQP5oi/ewW0= X-Google-Smtp-Source: ABdhPJydbIfkN0fEyXGB0jM4KFOYE2spk9nEprX0Wp7N/fpC/X8oW8XY/9iFxzAbdDrLYJ2uam3ogQ== X-Received: by 2002:a17:90a:a09:: with SMTP id o9mr6438714pjo.134.1602143778213; Thu, 08 Oct 2020 00:56:18 -0700 (PDT) Received: from localhost.localdomain ([203.205.141.61]) by smtp.gmail.com with ESMTPSA id k206sm6777106pfd.126.2020.10.08.00.56.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Oct 2020 00:56:17 -0700 (PDT) From: yulei.kernel@gmail.com X-Google-Original-From: yuleixzhang@tencent.com To: akpm@linux-foundation.org, naoya.horiguchi@nec.com, viro@zeniv.linux.org.uk, pbonzini@redhat.com Cc: linux-fsdevel@vger.kernel.org, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, xiaoguangrong.eric@gmail.com, kernellwp@gmail.com, lihaiwei.kernel@gmail.com, Yulei Zhang Subject: [PATCH 35/35] Add documentation for dmemfs Date: Thu, 8 Oct 2020 15:54:25 +0800 Message-Id: <4d1bc80e93134fb0f5691db5c4bb8bcbc1e716dd.1602093760.git.yuleixzhang@tencent.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org From: Yulei Zhang Introduce dmemfs.rst to document the basic usage of dmemfs. Signed-off-by: Yulei Zhang --- Documentation/filesystems/dmemfs.rst | 59 ++++++++++++++++++++++++++++ 1 file changed, 59 insertions(+) create mode 100644 Documentation/filesystems/dmemfs.rst diff --git a/Documentation/filesystems/dmemfs.rst b/Documentation/filesystems/dmemfs.rst new file mode 100644 index 000000000000..cbb4cc1ed31d --- /dev/null +++ b/Documentation/filesystems/dmemfs.rst @@ -0,0 +1,57 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================================== +The Direct Memory Filesystem - DMEMFS +===================================== + + +.. Table of contents + + - Overview + - Compilation + - Usage + +Overview +======== + +Dmemfs (Direct Memory filesystem) is device memory or reserved +memory based filesystem. This kind of memory is special as it +is not managed by kernel and it is without 'struct page'. Therefore +it can save extra memory from the host system for various usage, +especially for guest virtual machines. + +It uses a kernel boot parameter ``dmem=`` to reserve the system +memory when the host system boots up, the details can be checked +in /Documentation/admin-guide/kernel-parameters.txt. + +Compilation +=========== + +The filesystem should be enabled by turning on the kernel configuration +options:: + + CONFIG_DMEM_FS - Direct Memory filesystem support + CONFIG_DMEM - Allow reservation of memory for dmem + + +Additionally, the following can be turned on to aid debugging:: + + CONFIG_DMEM_DEBUG_FS - Enable debug information for dmem + +Usage +======== + +Dmemfs supports mapping ``4K``, ``2M`` and ``1G`` size of pages to +the userspace, for example :: + + # mount -t dmemfs none -o pagesize=4K /mnt/ + +The it can create the backing storage with 4G size :: + + # truncate /mnt/dmemfs-uuid --size 4G + +To use as backing storage for virtual machine starts with qemu, just need +to specify the memory-backed-file in the qemu command line like this :: + + # -object memory-backend-file,id=ram-node0,mem-path=/mnt/dmemfs-uuid \ + share=yes,size=4G,host-nodes=0,policy=preferred -numa node,nodeid=0,memdev=ram-node0