[v1,1/4] mm: handle poisoning of pfn without struct pages

Message ID	20230920140210.12663-2-ankita@nvidia.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.117.161 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.117.161; helo=mail.nvidia.com; pr=C From: <ankita@nvidia.com> To: <ankita@nvidia.com>, <jgg@nvidia.com>, <alex.williamson@redhat.com>, <akpm@linux-foundation.org>, <tony.luck@intel.com>, <bp@alien8.de>, <naoya.horiguchi@nec.com>, <linmiaohe@huawei.com> CC: <aniketa@nvidia.com>, <cjia@nvidia.com>, <kwankhede@nvidia.com>, <targupta@nvidia.com>, <vsethi@nvidia.com>, <acurrid@nvidia.com>, <anuaggarwal@nvidia.com>, <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>, <linux-edac@vger.kernel.org>, <kvm@vger.kernel.org> Subject: [PATCH v1 1/4] mm: handle poisoning of pfn without struct pages Date: Wed, 20 Sep 2023 19:32:07 +0530 Message-ID: <20230920140210.12663-2-ankita@nvidia.com> In-Reply-To: <20230920140210.12663-1-ankita@nvidia.com> References: <20230920140210.12663-1-ankita@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: Implement ECC handling for pfn with no struct page \| expand [v1,0/4] mm: Implement ECC handling for pfn with no struct page [v1,1/4] mm: handle poisoning of pfn without struct pages [v1,2/4] mm: Add poison error check in fixup_user_fault() for mapped pfn [v1,3/4] mm: Change ghes code to allow poison of non-struct pfn [v1,4/4] vfio/nvgpu: register device memory for poison handling

Message ID

20230920140210.12663-2-ankita@nvidia.com (mailing list archive)

State

New

Headers

Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates
 216.228.117.161 as permitted sender) receiver=protection.outlook.com;
 client-ip=216.228.117.161; helo=mail.nvidia.com; pr=C
From: <ankita@nvidia.com>
To: <ankita@nvidia.com>, <jgg@nvidia.com>, <alex.williamson@redhat.com>,
	<akpm@linux-foundation.org>, <tony.luck@intel.com>, <bp@alien8.de>,
	<naoya.horiguchi@nec.com>, <linmiaohe@huawei.com>
CC: <aniketa@nvidia.com>, <cjia@nvidia.com>, <kwankhede@nvidia.com>,
	<targupta@nvidia.com>, <vsethi@nvidia.com>, <acurrid@nvidia.com>,
	<anuaggarwal@nvidia.com>, <linux-kernel@vger.kernel.org>,
	<linux-mm@kvack.org>, <linux-edac@vger.kernel.org>, <kvm@vger.kernel.org>
Subject: [PATCH v1 1/4] mm: handle poisoning of pfn without struct pages
Date: Wed, 20 Sep 2023 19:32:07 +0530
Message-ID: <20230920140210.12663-2-ankita@nvidia.com>
In-Reply-To: <20230920140210.12663-1-ankita@nvidia.com>
References: <20230920140210.12663-1-ankita@nvidia.com>
MIME-Version: 1.0
Content-Type: text/plain
X-MS-Exchange-CrossTenant-OriginalArrivalTime: 20 Sep 2023 14:02:37.5536
 (UTC)
X-MS-Exchange-CrossTenant-Network-Message-Id: 
 21ab9138-4687-41fc-ce88-08dbb9e23f32
X-MS-Exchange-CrossTenant-Id: 43083d15-7273-40c1-b7db-39efd9ccc17a
X-MS-Exchange-CrossTenant-OriginalAttributedTenantConnectingIp: 
 TenantId=43083d15-7273-40c1-b7db-39efd9ccc17a;Ip=[216.228.117.161];Helo=[mail.nvidia.com]
X-MS-Exchange-CrossTenant-AuthSource: 
	DS2PEPF00003444.namprd04.prod.outlook.com
X-MS-Exchange-CrossTenant-AuthAs: Anonymous
X-MS-Exchange-CrossTenant-FromEntityHeader: HybridOnPrem
X-MS-Exchange-Transport-CrossTenantHeadersStamped: SN7PR12MB6931
X-Rspamd-Queue-Id: 76FDF4032B
X-Rspam-User: 
X-Stat-Signature: ppgg3mtkjtrh3i43c9ou6bhwrqnpkpao
X-Rspamd-Server: rspam03
X-HE-Tag: 1695218569-822908
X-HE-Meta: 
 U2FsdGVkX18KEFUyTEFXuak0ygEvKPU5MC96BU8M+3uws7ZdafXvZTMsfauMNZV7nQgaJJp9GvpkXNH1lLUXcKDAwgCG3Oz70ZGrwOhG48//gedbu8u2u9NY99J8/Cu6E9FCbqtkcbauWeI856qOQwOnu2tNeu/pLrnW+fEm7FqnbR/eu/HPlPPFvx/t35FAqX8qJvAeuFfsTOk+qTn/+jvFCPMG1M+se9+57aYXkmiv7B2mVOPq4ldXbZwt3ZK2PgIw7pPUohAw6rC9qoLUtE8L+ZvLbj7CyAhEbSwRh22mUhdHqGlloAwq9MzX48pen9YJxhzIsmxEFDuaKCb54bRBSi6bx4Xc8mYS1PN7ZLX8wdSQ8Qr8SGZjf8M4CzzpOrjc1M5PJgq0uaJHbWtN1sQ1dG1tCJxWTcR1cAN8SOdj0SiVR9mL7gWaHrTFrbcOL9IoZLe53XlTFDWHzwz9lGsfoNQRgjXXX9voXg4dfp3gPArLnQurFBZ1qTB+ragO1WR1NM2DReW04dTLXfWmMJkhVWw/knQ75WcMJQuhMqzHfz2y+9Nsv2qkjgKZp1IGLcg1ZDqBqkmcKTDRghiyJKKpBYhRnEwdU48IG0ViVtpgpdMjksOTVPZvzsKrTwtb41sDROArmd5hwS1m15N2O/b8NGN0CpaOAC7Yel2QUHvXAskenWq5Ce7jRP1c5UpXNpDLffcZs+Oh9qEjEkZkE4NCD/pXdnKVTNVc38ZcZplKxX1SLyNSn9zYZudFnIBxNi9ijrs+//4stms73WuP7D34fTu7TKM49wtrkE92K9+XPBn8JGIA1boFFo65FLbt2p2UIeqWZ9ZTIxlJWCxeAPRZ6nCdnyeT3gCS5E3swa/f6jx2RG49PT2kCkKUFOHkfA6wKU/QCXa66a7Eld1LwxhzEDnScXSwCpFfjsZPn1O1kDJddAsmSxOY/WT48HWNBFAFTLnb/qj2h6ThfGu
 ynC8CZh+
 ztBJI6WrNqLyfvB6/Zhk41DUvJwfq/bDV7Lxzu6zP00amfz+b5klrThIN9xaRxYZhQf/V1d4/ud9VYQeGzx+buS0/Ye4iLzpPqfpyNOe3TLuENSuxjtMHiUx4/9nPJ7ctTbbqdMlGAt8xmvN8WjnPOBkqAi2ChffEHm3Z0HU8353zjipM6XiWcFirdjX7KVlhbxGWtiHhpvyl2BrBKSFrjYqF12/l0dRJuD6f
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Series

mm: Implement ECC handling for pfn with no struct page | expand

Commit Message

Ankit Agrawal Sept. 20, 2023, 2:02 p.m. UTC

From: Ankit Agrawal <ankita@nvidia.com>

The kernel MM currently does not handle ECC errors / poison on a memory
region that is not backed by struct pages. If a memory region is mapped
using remap_pfn_range(), but not added to the kernel, MM will not have
associated struct pages. Add a new mechanism to handle memory failure
on such memory.

Make kernel MM expose a function to allow modules managing the device
memory to register a failure function and the physical address space
associated with the device memory. MM maintains this information as
interval tree. The registered memory failure function is used by MM to
notify the kernel module managing the PFN, so that the module may take
any required action. The module for example may use the information
to track the poisoned pages.

In this implementation, kernel MM follows the following sequence similar
(mostly) to the memory_failure() handler for struct page backed memory:
1. memory_failure() is triggered on reception of a poison error. An
absence of struct page is detected and consequently memory_failure_pfn()
is executed.
2. memory_failure_pfn() call the newly introduced failure handler exposed
by the module managing the poisoned memory to notify it of the problematic
PFN.
3. memory_failure_pfn() unmaps the stage-2 mapping to the PFN.
4. memory_failure_pfn() collects the processes mapped to the PFN.
5. memory_failure_pfn() sends SIGBUS (BUS_MCEERR_AO) to all the processes
mapping the faulty PFN using kill_procs().
6. An access to the faulty PFN by an operation in VM at a later point
is trapped and user_mem_abort() is called.
7. The vma ops fault function gets called due to the absence of Stage-2
mapping. It is expected to return VM_FAULT_HWPOISON on the PFN.
8. __gfn_to_pfn_memslot() then returns KVM_PFN_ERR_HWPOISON, which cause
the poison with SIGBUS (BUS_MCEERR_AR) to be sent to the QEMU process
through kvm_send_hwpoison_signal().

Signed-off-by: Ankit Agrawal <ankita@nvidia.com>
---
 include/linux/memory-failure.h |  22 ++++++
 include/linux/mm.h             |   1 +
 include/ras/ras_event.h        |   1 +
 mm/Kconfig                     |   1 +
 mm/memory-failure.c            | 135 ++++++++++++++++++++++++++++-----
 5 files changed, 139 insertions(+), 21 deletions(-)
 create mode 100644 include/linux/memory-failure.h

Comments

Miaohe Lin Sept. 23, 2023, 3:20 a.m. UTC | #1

On 2023/9/20 22:02, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The kernel MM currently does not handle ECC errors / poison on a memory
> region that is not backed by struct pages. If a memory region is mapped
> using remap_pfn_range(), but not added to the kernel, MM will not have
> associated struct pages. Add a new mechanism to handle memory failure
> on such memory.
> 
> Make kernel MM expose a function to allow modules managing the device
> memory to register a failure function and the physical address space
> associated with the device memory. MM maintains this information as
> interval tree. The registered memory failure function is used by MM to
> notify the kernel module managing the PFN, so that the module may take
> any required action. The module for example may use the information
> to track the poisoned pages.
> 
> In this implementation, kernel MM follows the following sequence similar
> (mostly) to the memory_failure() handler for struct page backed memory:
> 1. memory_failure() is triggered on reception of a poison error. An
> absence of struct page is detected and consequently memory_failure_pfn()
> is executed.
> 2. memory_failure_pfn() call the newly introduced failure handler exposed
> by the module managing the poisoned memory to notify it of the problematic
> PFN.
> 3. memory_failure_pfn() unmaps the stage-2 mapping to the PFN.
> 4. memory_failure_pfn() collects the processes mapped to the PFN.
> 5. memory_failure_pfn() sends SIGBUS (BUS_MCEERR_AO) to all the processes
> mapping the faulty PFN using kill_procs().
> 6. An access to the faulty PFN by an operation in VM at a later point
> is trapped and user_mem_abort() is called.
> 7. The vma ops fault function gets called due to the absence of Stage-2
> mapping. It is expected to return VM_FAULT_HWPOISON on the PFN.
> 8. __gfn_to_pfn_memslot() then returns KVM_PFN_ERR_HWPOISON, which cause
> the poison with SIGBUS (BUS_MCEERR_AR) to be sent to the QEMU process
> through kvm_send_hwpoison_signal().
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

Thanks for your patch.

<snip>

>  /*
>   * Return values:
>   *   1:   the page is dissolved (if needed) and taken off from buddy,
> @@ -422,15 +428,15 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
>   * Schedule a process for later kill.
>   * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
>   *
> - * Note: @fsdax_pgoff is used only when @p is a fsdax page and a
> - * filesystem with a memory failure handler has claimed the
> - * memory_failure event. In all other cases, page->index and
> - * page->mapping are sufficient for mapping the page back to its
> + * Notice: @pgoff is used either when @p is a fsdax page or a PFN is not
> + * backed by struct page and a filesystem with a memory failure handler
> + * has claimed the memory_failure event. In all other cases, page->index
> + * and page->mapping are sufficient for mapping the page back to its
>   * corresponding user virtual address.
>   */
>  static void __add_to_kill(struct task_struct *tsk, struct page *p,
>  			  struct vm_area_struct *vma, struct list_head *to_kill,
> -			  unsigned long ksm_addr, pgoff_t fsdax_pgoff)
> +			  unsigned long ksm_addr, pgoff_t pgoff)
>  {
>  	struct to_kill *tk;
>  
> @@ -440,13 +446,18 @@ static void __add_to_kill(struct task_struct *tsk, struct page *p,
>  		return;
>  	}
>  
> -	tk->addr = ksm_addr ? ksm_addr : page_address_in_vma(p, vma);
> -	if (is_zone_device_page(p)) {
> -		if (fsdax_pgoff != FSDAX_INVALID_PGOFF)
> -			tk->addr = vma_pgoff_address(fsdax_pgoff, 1, vma);
> -		tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
> -	} else
> -		tk->size_shift = page_shift(compound_head(p));
> +	if (vma->vm_flags | PFN_MAP) {

if (vma->vm_flags | PFN_MAP)? So this branch is always selected?

> +		tk->addr = vma_pgoff_address(pgoff, 1, vma);
> +		tk->size_shift = PAGE_SHIFT;
> +	} else {
> +		tk->addr = ksm_addr ? ksm_addr : page_address_in_vma(p, vma);
> +		if (is_zone_device_page(p)) {
> +			if (pgoff != FSDAX_INVALID_PGOFF)
> +				tk->addr = vma_pgoff_address(pgoff, 1, vma);
> +			tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
> +		} else
> +			tk->size_shift = page_shift(compound_head(p));
> +	}
>  

IIUC, the page passed to __add_to_kill is NULL in this case. So when tk->addr == -EFAULT, we will have problem
to do the page_to_pfn(p) in the following pr_info:

	if (tk->addr == -EFAULT) {
		pr_info("Unable to find user space address %lx in %s\n",
			page_to_pfn(p), tsk->comm);

>  	/*
>  	 * Send SIGKILL if "tk->addr == -EFAULT". Also, as
> @@ -666,8 +677,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
>  	i_mmap_unlock_read(mapping);
>  }
>  

<snip>

>  /**
>   * memory_failure - Handle memory failure of a page.
>   * @pfn: Page Number of the corrupted page
> @@ -2183,6 +2271,11 @@ int memory_failure(unsigned long pfn, int flags)
>  	if (!(flags & MF_SW_SIMULATED))
>  		hw_memory_failure = true;
>  
> +	if (!pfn_valid(pfn) && !arch_is_platform_page(PFN_PHYS(pfn))) {

Could it be better to add a helper here to detect the pfns without struct page?

> +		res = memory_failure_pfn(pfn, flags);
> +		goto unlock_mutex;
> +	}
> +
>  	p = pfn_to_online_page(pfn);
>  	if (!p) {
>  		res = arch_memory_failure(pfn, flags);
> 

Thanks.

Jason Gunthorpe Sept. 25, 2023, 12:36 p.m. UTC | #2

On Sat, Sep 23, 2023 at 11:20:19AM +0800, Miaohe Lin wrote:

> >  /**
> >   * memory_failure - Handle memory failure of a page.
> >   * @pfn: Page Number of the corrupted page
> > @@ -2183,6 +2271,11 @@ int memory_failure(unsigned long pfn, int flags)
> >  	if (!(flags & MF_SW_SIMULATED))
> >  		hw_memory_failure = true;
> >  
> > +	if (!pfn_valid(pfn) && !arch_is_platform_page(PFN_PHYS(pfn))) {
> 
> Could it be better to add a helper here to detect the pfns without
> struct page?

pfn_valid is supposed to do that.

This arch_is_platform_page stuff is actually detecting Intel SGX
memory and routing it to arch_memory_failure()

It would have been more accurately named
arch_is_arch_memory_failure_pfn() or something

Actually that SGX stuff could probably be changed over to use the
interval tree of this series. Modify sgx_setup_epc_section() to
register tree nodes per-section and remove all this arch stuff
entirely.

Jason

Naoya Horiguchi Sept. 26, 2023, 7:23 a.m. UTC | #3

On Wed, Sep 20, 2023 at 07:32:07PM +0530, ankita@nvidia.com wrote:
> From: Ankit Agrawal <ankita@nvidia.com>
> 
> The kernel MM currently does not handle ECC errors / poison on a memory
> region that is not backed by struct pages. If a memory region is mapped
> using remap_pfn_range(), but not added to the kernel, MM will not have
> associated struct pages. Add a new mechanism to handle memory failure
> on such memory.
> 
> Make kernel MM expose a function to allow modules managing the device
> memory to register a failure function and the physical address space
> associated with the device memory. MM maintains this information as
> interval tree. The registered memory failure function is used by MM to
> notify the kernel module managing the PFN, so that the module may take
> any required action. The module for example may use the information
> to track the poisoned pages.
> 
> In this implementation, kernel MM follows the following sequence similar
> (mostly) to the memory_failure() handler for struct page backed memory:
> 1. memory_failure() is triggered on reception of a poison error. An
> absence of struct page is detected and consequently memory_failure_pfn()
> is executed.
> 2. memory_failure_pfn() call the newly introduced failure handler exposed
> by the module managing the poisoned memory to notify it of the problematic
> PFN.
> 3. memory_failure_pfn() unmaps the stage-2 mapping to the PFN.
> 4. memory_failure_pfn() collects the processes mapped to the PFN.
> 5. memory_failure_pfn() sends SIGBUS (BUS_MCEERR_AO) to all the processes
> mapping the faulty PFN using kill_procs().
> 6. An access to the faulty PFN by an operation in VM at a later point
> is trapped and user_mem_abort() is called.
> 7. The vma ops fault function gets called due to the absence of Stage-2
> mapping. It is expected to return VM_FAULT_HWPOISON on the PFN.
> 8. __gfn_to_pfn_memslot() then returns KVM_PFN_ERR_HWPOISON, which cause
> the poison with SIGBUS (BUS_MCEERR_AR) to be sent to the QEMU process
> through kvm_send_hwpoison_signal().
> 
> Signed-off-by: Ankit Agrawal <ankita@nvidia.com>

Thanks for the patches.

A few comment below ...

...

> @@ -422,15 +428,15 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
>   * Schedule a process for later kill.
>   * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
>   *
> - * Note: @fsdax_pgoff is used only when @p is a fsdax page and a
> - * filesystem with a memory failure handler has claimed the
> - * memory_failure event. In all other cases, page->index and
> - * page->mapping are sufficient for mapping the page back to its
> + * Notice: @pgoff is used either when @p is a fsdax page or a PFN is not
> + * backed by struct page and a filesystem with a memory failure handler
> + * has claimed the memory_failure event.

This sentense is unclear because latter part ("a filesystem with ...")
is not true for pfns not backed by struct page.  Could you separate this
notice into two (one for fsdax case and one for "non struct page" case)?

> In all other cases, page->index
> + * and page->mapping are sufficient for mapping the page back to its
>   * corresponding user virtual address.
>   */
>  static void __add_to_kill(struct task_struct *tsk, struct page *p,
>  			  struct vm_area_struct *vma, struct list_head *to_kill,
> -			  unsigned long ksm_addr, pgoff_t fsdax_pgoff)
> +			  unsigned long ksm_addr, pgoff_t pgoff)
>  {
>  	struct to_kill *tk;
>  

...

> @@ -677,9 +687,9 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
>  /*
>   * Collect processes when the error hit a fsdax page.

Maybe you need update the comment not to restrict to fsdax page?

>   */
> -static void collect_procs_fsdax(struct page *page,
> -		struct address_space *mapping, pgoff_t pgoff,
> -		struct list_head *to_kill)
> +static void collect_procs_pgoff(struct page *page,
> +				struct address_space *mapping, pgoff_t pgoff,
> +				struct list_head *to_kill)
>  {
>  	struct vm_area_struct *vma;
>  	struct task_struct *tsk;

...

> @@ -2144,6 +2155,83 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>  	return rc;
>  }
>  
> +int register_pfn_address_space(struct pfn_address_space *pfn_space)
> +{
> +	if (!pfn_space)
> +		return -EINVAL;
> +
> +	if (!request_mem_region(pfn_space->node.start << PAGE_SHIFT,
> +		(pfn_space->node.last - pfn_space->node.start + 1) << PAGE_SHIFT, ""))
> +		return -EBUSY;
> +
> +	mutex_lock(&pfn_space_lock);
> +	interval_tree_insert(&pfn_space->node, &pfn_space_itree);
> +	mutex_unlock(&pfn_space_lock);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(register_pfn_address_space);
> +
> +void unregister_pfn_address_space(struct pfn_address_space *pfn_space)
> +{
> +	if (!pfn_space)
> +		return;
> +
> +	mutex_lock(&pfn_space_lock);
> +	interval_tree_remove(&pfn_space->node, &pfn_space_itree);
> +	mutex_unlock(&pfn_space_lock);
> +	release_mem_region(pfn_space->node.start << PAGE_SHIFT,
> +		(pfn_space->node.last - pfn_space->node.start + 1) << PAGE_SHIFT);
> +}
> +EXPORT_SYMBOL_GPL(unregister_pfn_address_space);
> +
> +static int memory_failure_pfn(unsigned long pfn, int flags)
> +{
> +	struct interval_tree_node *node;
> +	int res = MF_FAILED;
> +	LIST_HEAD(tokill);
> +
> +	mutex_lock(&pfn_space_lock);
> +	/*
> +	 * Modules registers with MM the address space mapping to the device memory they
> +	 * manage. Iterate to identify exactly which address space has mapped to this
> +	 * failing PFN.
> +	 */
> +	for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node;
> +	     node = interval_tree_iter_next(node, pfn, pfn)) {
> +		struct pfn_address_space *pfn_space =
> +			container_of(node, struct pfn_address_space, node);
> +		/*
> +		 * Modules managing the device memory need to be conveyed about the
> +		 * memory failure so that the poisoned PFN can be tracked.
> +		 */
> +		if (pfn_space->ops)
> +			pfn_space->ops->failure(pfn_space, pfn);
> +
> +		collect_procs_pgoff(NULL, pfn_space->mapping, pfn, &tokill);
> +
> +		unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT,
> +				    PAGE_SIZE, 0);
> +
> +		res = MF_RECOVERED;
> +	}
> +	mutex_unlock(&pfn_space_lock);
> +
> +	if (res == MF_FAILED)
> +		return action_result(pfn, MF_MSG_PFN_MAP, res);
> +
> +	/*
> +	 * Unlike System-RAM there is no possibility to swap in a different
> +	 * physical page at a given virtual address, so all userspace
> +	 * consumption of direct PFN memory necessitates SIGBUS (i.e.
> +	 * MF_MUST_KILL)
> +	 */
> +	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> +	kill_procs(&tokill, true, false, pfn, flags);
> +
> +	return action_result(pfn, MF_MSG_PFN_MAP, MF_RECOVERED);
> +}
> +

It might not be a major issue, but these new code above seems to be used
only when CONFIG_NVGRACE_GPU_VFIO_PCI is enabled, so putting this in
#ifdef block might be helpful to save binary size without nvgrace-gpu-vfio-pci.

Thanks,
Naoya Horiguchi

>  /**
>   * memory_failure - Handle memory failure of a page.
>   * @pfn: Page Number of the corrupted page
> @@ -2183,6 +2271,11 @@ int memory_failure(unsigned long pfn, int flags)
>  	if (!(flags & MF_SW_SIMULATED))
>  		hw_memory_failure = true;
>  
> +	if (!pfn_valid(pfn) && !arch_is_platform_page(PFN_PHYS(pfn))) {
> +		res = memory_failure_pfn(pfn, flags);
> +		goto unlock_mutex;
> +	}
> +
>  	p = pfn_to_online_page(pfn);
>  	if (!p) {
>  		res = arch_memory_failure(pfn, flags);
> -- 
> 2.17.1
> 
> 
>

diff --git a/include/linux/memory-failure.h b/include/linux/memory-failure.h
new file mode 100644
index 000000000000..9a579960972a
--- /dev/null
+++ b/include/linux/memory-failure.h
@@ -0,0 +1,22 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMORY_FAILURE_H
+#define _LINUX_MEMORY_FAILURE_H
+
+#include <linux/interval_tree.h>
+
+struct pfn_address_space;
+
+struct pfn_address_space_ops {
+	void (*failure)(struct pfn_address_space *pfn_space, unsigned long pfn);
+};
+
+struct pfn_address_space {
+	struct interval_tree_node node;
+	const struct pfn_address_space_ops *ops;
+	struct address_space *mapping;
+};
+
+int register_pfn_address_space(struct pfn_address_space *pfn_space);
+void unregister_pfn_address_space(struct pfn_address_space *pfn_space);
+
+#endif /* _LINUX_MEMORY_FAILURE_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf5d0b1b16f4..d677688c016c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3934,6 +3934,7 @@  enum mf_action_page_type {
 	MF_MSG_BUDDY,
 	MF_MSG_DAX,
 	MF_MSG_UNSPLIT_THP,
+	MF_MSG_PFN_MAP,
 	MF_MSG_UNKNOWN,
 };
 
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index cbd3ddd7c33d..05c3e6f6bd02 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -373,6 +373,7 @@  TRACE_EVENT(aer_event,
 	EM ( MF_MSG_BUDDY, "free buddy page" )				\
 	EM ( MF_MSG_DAX, "dax page" )					\
 	EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" )			\
+	EM ( MF_MSG_PFN_MAP, "non struct page pfn" )			\
 	EMe ( MF_MSG_UNKNOWN, "unknown page" )
 
 /*
diff --git a/mm/Kconfig b/mm/Kconfig
index 264a2df5ecf5..2ee42ff8b6ca 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -762,6 +762,7 @@  config MEMORY_FAILURE
 	depends on ARCH_SUPPORTS_MEMORY_FAILURE
 	bool "Enable recovery from hardware memory errors"
 	select MEMORY_ISOLATION
+	select INTERVAL_TREE
 	select RAS
 	help
 	  Enables code to recover from some memory failures on systems
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4d6e43c88489..e1e1d96fd6a2 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -38,6 +38,7 @@ 
 
 #include <linux/kernel.h>
 #include <linux/mm.h>
+#include <linux/memory-failure.h>
 #include <linux/page-flags.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/task.h>
@@ -60,6 +61,7 @@ 
 #include <linux/pagewalk.h>
 #include <linux/shmem_fs.h>
 #include <linux/sysctl.h>
+#include <linux/pfn_t.h>
 #include "swap.h"
 #include "internal.h"
 #include "ras/ras_event.h"
@@ -144,6 +146,10 @@  static struct ctl_table memory_failure_table[] = {
 	{ }
 };
 
+static struct rb_root_cached pfn_space_itree = RB_ROOT_CACHED;
+
+static DEFINE_MUTEX(pfn_space_lock);
+
 /*
  * Return values:
  *   1:   the page is dissolved (if needed) and taken off from buddy,
@@ -422,15 +428,15 @@  static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
  * Schedule a process for later kill.
  * Uses GFP_ATOMIC allocations to avoid potential recursions in the VM.
  *
- * Note: @fsdax_pgoff is used only when @p is a fsdax page and a
- * filesystem with a memory failure handler has claimed the
- * memory_failure event. In all other cases, page->index and
- * page->mapping are sufficient for mapping the page back to its
+ * Notice: @pgoff is used either when @p is a fsdax page or a PFN is not
+ * backed by struct page and a filesystem with a memory failure handler
+ * has claimed the memory_failure event. In all other cases, page->index
+ * and page->mapping are sufficient for mapping the page back to its
  * corresponding user virtual address.
  */
 static void __add_to_kill(struct task_struct *tsk, struct page *p,
 			  struct vm_area_struct *vma, struct list_head *to_kill,
-			  unsigned long ksm_addr, pgoff_t fsdax_pgoff)
+			  unsigned long ksm_addr, pgoff_t pgoff)
 {
 	struct to_kill *tk;
 
@@ -440,13 +446,18 @@  static void __add_to_kill(struct task_struct *tsk, struct page *p,
 		return;
 	}
 
-	tk->addr = ksm_addr ? ksm_addr : page_address_in_vma(p, vma);
-	if (is_zone_device_page(p)) {
-		if (fsdax_pgoff != FSDAX_INVALID_PGOFF)
-			tk->addr = vma_pgoff_address(fsdax_pgoff, 1, vma);
-		tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
-	} else
-		tk->size_shift = page_shift(compound_head(p));
+	if (vma->vm_flags | PFN_MAP) {
+		tk->addr = vma_pgoff_address(pgoff, 1, vma);
+		tk->size_shift = PAGE_SHIFT;
+	} else {
+		tk->addr = ksm_addr ? ksm_addr : page_address_in_vma(p, vma);
+		if (is_zone_device_page(p)) {
+			if (pgoff != FSDAX_INVALID_PGOFF)
+				tk->addr = vma_pgoff_address(pgoff, 1, vma);
+			tk->size_shift = dev_pagemap_mapping_shift(vma, tk->addr);
+		} else
+			tk->size_shift = page_shift(compound_head(p));
+	}
 
 	/*
 	 * Send SIGKILL if "tk->addr == -EFAULT". Also, as
@@ -666,8 +677,7 @@  static void collect_procs_file(struct page *page, struct list_head *to_kill,
 	i_mmap_unlock_read(mapping);
 }
 
-#ifdef CONFIG_FS_DAX
-static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
+static void add_to_kill_pgoff(struct task_struct *tsk, struct page *p,
 			      struct vm_area_struct *vma,
 			      struct list_head *to_kill, pgoff_t pgoff)
 {
@@ -677,9 +687,9 @@  static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p,
 /*
  * Collect processes when the error hit a fsdax page.
  */
-static void collect_procs_fsdax(struct page *page,
-		struct address_space *mapping, pgoff_t pgoff,
-		struct list_head *to_kill)
+static void collect_procs_pgoff(struct page *page,
+				struct address_space *mapping, pgoff_t pgoff,
+				struct list_head *to_kill)
 {
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
@@ -693,13 +703,12 @@  static void collect_procs_fsdax(struct page *page,
 			continue;
 		vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
 			if (vma->vm_mm == t->mm)
-				add_to_kill_fsdax(t, page, vma, to_kill, pgoff);
+				add_to_kill_pgoff(t, page, vma, to_kill, pgoff);
 		}
 	}
 	rcu_read_unlock();
 	i_mmap_unlock_read(mapping);
 }
-#endif /* CONFIG_FS_DAX */
 
 /*
  * Collect the processes who have the corrupted page mapped to kill.
@@ -893,6 +902,7 @@  static const char * const action_page_types[] = {
 	[MF_MSG_BUDDY]			= "free buddy page",
 	[MF_MSG_DAX]			= "dax page",
 	[MF_MSG_UNSPLIT_THP]		= "unsplit thp",
+	[MF_MSG_PFN_MAP]		= "non struct page pfn",
 	[MF_MSG_UNKNOWN]		= "unknown page",
 };
 
@@ -1324,7 +1334,8 @@  static int action_result(unsigned long pfn, enum mf_action_page_type type,
 
 	num_poisoned_pages_inc(pfn);
 
-	update_per_node_mf_stats(pfn, result);
+	if (type != MF_MSG_PFN_MAP)
+		update_per_node_mf_stats(pfn, result);
 
 	pr_err("%#lx: recovery action for %s: %s\n",
 		pfn, action_page_types[type], action_name[result]);
@@ -1805,7 +1816,7 @@  int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
 
 		SetPageHWPoison(page);
 
-		collect_procs_fsdax(page, mapping, index, &to_kill);
+		collect_procs_pgoff(page, mapping, index, &to_kill);
 		unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
 				index, mf_flags);
 unlock:
@@ -2144,6 +2155,83 @@  static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	return rc;
 }
 
+int register_pfn_address_space(struct pfn_address_space *pfn_space)
+{
+	if (!pfn_space)
+		return -EINVAL;
+
+	if (!request_mem_region(pfn_space->node.start << PAGE_SHIFT,
+		(pfn_space->node.last - pfn_space->node.start + 1) << PAGE_SHIFT, ""))
+		return -EBUSY;
+
+	mutex_lock(&pfn_space_lock);
+	interval_tree_insert(&pfn_space->node, &pfn_space_itree);
+	mutex_unlock(&pfn_space_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(register_pfn_address_space);
+
+void unregister_pfn_address_space(struct pfn_address_space *pfn_space)
+{
+	if (!pfn_space)
+		return;
+
+	mutex_lock(&pfn_space_lock);
+	interval_tree_remove(&pfn_space->node, &pfn_space_itree);
+	mutex_unlock(&pfn_space_lock);
+	release_mem_region(pfn_space->node.start << PAGE_SHIFT,
+		(pfn_space->node.last - pfn_space->node.start + 1) << PAGE_SHIFT);
+}
+EXPORT_SYMBOL_GPL(unregister_pfn_address_space);
+
+static int memory_failure_pfn(unsigned long pfn, int flags)
+{
+	struct interval_tree_node *node;
+	int res = MF_FAILED;
+	LIST_HEAD(tokill);
+
+	mutex_lock(&pfn_space_lock);
+	/*
+	 * Modules registers with MM the address space mapping to the device memory they
+	 * manage. Iterate to identify exactly which address space has mapped to this
+	 * failing PFN.
+	 */
+	for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node;
+	     node = interval_tree_iter_next(node, pfn, pfn)) {
+		struct pfn_address_space *pfn_space =
+			container_of(node, struct pfn_address_space, node);
+		/*
+		 * Modules managing the device memory need to be conveyed about the
+		 * memory failure so that the poisoned PFN can be tracked.
+		 */
+		if (pfn_space->ops)
+			pfn_space->ops->failure(pfn_space, pfn);
+
+		collect_procs_pgoff(NULL, pfn_space->mapping, pfn, &tokill);
+
+		unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT,
+				    PAGE_SIZE, 0);
+
+		res = MF_RECOVERED;
+	}
+	mutex_unlock(&pfn_space_lock);
+
+	if (res == MF_FAILED)
+		return action_result(pfn, MF_MSG_PFN_MAP, res);
+
+	/*
+	 * Unlike System-RAM there is no possibility to swap in a different
+	 * physical page at a given virtual address, so all userspace
+	 * consumption of direct PFN memory necessitates SIGBUS (i.e.
+	 * MF_MUST_KILL)
+	 */
+	flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
+	kill_procs(&tokill, true, false, pfn, flags);
+
+	return action_result(pfn, MF_MSG_PFN_MAP, MF_RECOVERED);
+}
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -2183,6 +2271,11 @@  int memory_failure(unsigned long pfn, int flags)
 	if (!(flags & MF_SW_SIMULATED))
 		hw_memory_failure = true;
 
+	if (!pfn_valid(pfn) && !arch_is_platform_page(PFN_PHYS(pfn))) {
+		res = memory_failure_pfn(pfn, flags);
+		goto unlock_mutex;
+	}
+
 	p = pfn_to_online_page(pfn);
 	if (!p) {
 		res = arch_memory_failure(pfn, flags);

[v1,1/4] mm: handle poisoning of pfn without struct pages

Commit Message

Comments

Patch