[07/11] mm, madvise_inject_error: fix page count leak
diff mbox

Message ID 152700000922.24093.14813242965473482705.stgit@dwillia2-desk3.amr.corp.intel.com
State New
Headers show

Commit Message

Dan Williams May 22, 2018, 2:40 p.m. UTC
The madvise_inject_error() routine uses get_user_pages() to lookup the
pfn and other information for injected error, but it fails to release
that pin.

The dax-dma-vs-truncate warning catches this failure with the following
signature:

 Injecting memory failure for pfn 0x208900 at process virtual address 0x7f3908d00000
 Memory failure: 0x208900: reserved kernel page still referenced by 1 users
 Memory failure: 0x208900: recovery action for reserved kernel page: Failed
 WARNING: CPU: 37 PID: 9566 at fs/dax.c:348 dax_disassociate_entry+0x4e/0x90
 CPU: 37 PID: 9566 Comm: umount Tainted: G        W  OE     4.17.0-rc6+ #1900
 [..]
 RIP: 0010:dax_disassociate_entry+0x4e/0x90
 RSP: 0018:ffffc9000a9b3b30 EFLAGS: 00010002
 RAX: ffffea0008224000 RBX: 0000000000208a00 RCX: 0000000000208900
 RDX: 0000000000000001 RSI: ffff8804058c6160 RDI: 0000000000000008
 RBP: 000000000822000a R08: 0000000000000002 R09: 0000000000208800
 R10: 0000000000000000 R11: 0000000000208801 R12: ffff8804058c6168
 R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000001
 FS:  00007f4548027fc0(0000) GS:ffff880431d40000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000056316d5f8988 CR3: 00000004298cc000 CR4: 00000000000406e0
 Call Trace:
  __dax_invalidate_mapping_entry+0xab/0xe0
  dax_delete_mapping_entry+0xf/0x20
  truncate_exceptional_pvec_entries.part.14+0x1d4/0x210
  truncate_inode_pages_range+0x291/0x920
  ? kmem_cache_free+0x1f8/0x300
  ? lock_acquire+0x9f/0x200
  ? truncate_inode_pages_final+0x31/0x50
  ext4_evict_inode+0x69/0x740

Cc: <stable@vger.kernel.org>
Fixes: bd1ce5f91f54 ("HWPOISON: avoid grabbing the page count...")
Cc: Michal Hocko <mhocko@suse.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 mm/madvise.c |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

Comments

Naoya Horiguchi May 23, 2018, 4:19 a.m. UTC | #1
On Tue, May 22, 2018 at 07:40:09AM -0700, Dan Williams wrote:
> The madvise_inject_error() routine uses get_user_pages() to lookup the
> pfn and other information for injected error, but it fails to release
> that pin.
> 
> The dax-dma-vs-truncate warning catches this failure with the following
> signature:
> 
>  Injecting memory failure for pfn 0x208900 at process virtual address 0x7f3908d00000
>  Memory failure: 0x208900: reserved kernel page still referenced by 1 users
>  Memory failure: 0x208900: recovery action for reserved kernel page: Failed
>  WARNING: CPU: 37 PID: 9566 at fs/dax.c:348 dax_disassociate_entry+0x4e/0x90
>  CPU: 37 PID: 9566 Comm: umount Tainted: G        W  OE     4.17.0-rc6+ #1900
>  [..]
>  RIP: 0010:dax_disassociate_entry+0x4e/0x90
>  RSP: 0018:ffffc9000a9b3b30 EFLAGS: 00010002
>  RAX: ffffea0008224000 RBX: 0000000000208a00 RCX: 0000000000208900
>  RDX: 0000000000000001 RSI: ffff8804058c6160 RDI: 0000000000000008
>  RBP: 000000000822000a R08: 0000000000000002 R09: 0000000000208800
>  R10: 0000000000000000 R11: 0000000000208801 R12: ffff8804058c6168
>  R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000001
>  FS:  00007f4548027fc0(0000) GS:ffff880431d40000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 000056316d5f8988 CR3: 00000004298cc000 CR4: 00000000000406e0
>  Call Trace:
>   __dax_invalidate_mapping_entry+0xab/0xe0
>   dax_delete_mapping_entry+0xf/0x20
>   truncate_exceptional_pvec_entries.part.14+0x1d4/0x210
>   truncate_inode_pages_range+0x291/0x920
>   ? kmem_cache_free+0x1f8/0x300
>   ? lock_acquire+0x9f/0x200
>   ? truncate_inode_pages_final+0x31/0x50
>   ext4_evict_inode+0x69/0x740
> 
> Cc: <stable@vger.kernel.org>
> Fixes: bd1ce5f91f54 ("HWPOISON: avoid grabbing the page count...")
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
>  mm/madvise.c |   11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 4d3c922ea1a1..246fa4d4eee2 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior,
>  
>  
>  	for (; start < end; start += PAGE_SIZE << order) {
> +		unsigned long pfn;
>  		int ret;
>  
>  		ret = get_user_pages_fast(start, 1, 0, &page);
>  		if (ret != 1)
>  			return ret;
> +		pfn = page_to_pfn(page);
>  
>  		/*
>  		 * When soft offlining hugepages, after migrating the page
> @@ -651,17 +653,20 @@ static int madvise_inject_error(int behavior,
>  
>  		if (behavior == MADV_SOFT_OFFLINE) {
>  			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
> -						page_to_pfn(page), start);
> +					pfn, start);
>  
>  			ret = soft_offline_page(page, MF_COUNT_INCREASED);
> +			put_page(page);
>  			if (ret)
>  				return ret;
>  			continue;
>  		}
> +		put_page(page);

We keep the page count pinned after the isolation of the error page
in order to make sure that the error page is disabled and never reused.
This seems not explicit enough, so some comment should be helpful.

BTW, looking at the kernel message like "Memory failure: 0x208900:
reserved kernel page still referenced by 1 users", memory_failure()
considers dav_pagemap pages as "reserved kernel pages" (MF_MSG_KERNEL).
If memory error handler recovers a dav_pagemap page in its special way,
we can define a new action_page_types entry like MF_MSG_DAX.
Reporting like "Memory failure: 0xXXXXX: recovery action for dax page:
Failed" might be helpful for end user's perspective.

Thanks,
Naoya Horiguchi

> +
>  		pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
> -						page_to_pfn(page), start);
> +				pfn, start);
>  
> -		ret = memory_failure(page_to_pfn(page), MF_COUNT_INCREASED);
> +		ret = memory_failure(pfn, MF_COUNT_INCREASED);
>  		if (ret)
>  			return ret;
>  	}
>
Dan Williams May 24, 2018, 8:55 p.m. UTC | #2
On Tue, May 22, 2018 at 9:19 PM, Naoya Horiguchi
<n-horiguchi@ah.jp.nec.com> wrote:
> On Tue, May 22, 2018 at 07:40:09AM -0700, Dan Williams wrote:
>> The madvise_inject_error() routine uses get_user_pages() to lookup the
>> pfn and other information for injected error, but it fails to release
>> that pin.
>>
>> The dax-dma-vs-truncate warning catches this failure with the following
>> signature:
>>
>>  Injecting memory failure for pfn 0x208900 at process virtual address 0x7f3908d00000
>>  Memory failure: 0x208900: reserved kernel page still referenced by 1 users
>>  Memory failure: 0x208900: recovery action for reserved kernel page: Failed
>>  WARNING: CPU: 37 PID: 9566 at fs/dax.c:348 dax_disassociate_entry+0x4e/0x90
>>  CPU: 37 PID: 9566 Comm: umount Tainted: G        W  OE     4.17.0-rc6+ #1900
>>  [..]
>>  RIP: 0010:dax_disassociate_entry+0x4e/0x90
>>  RSP: 0018:ffffc9000a9b3b30 EFLAGS: 00010002
>>  RAX: ffffea0008224000 RBX: 0000000000208a00 RCX: 0000000000208900
>>  RDX: 0000000000000001 RSI: ffff8804058c6160 RDI: 0000000000000008
>>  RBP: 000000000822000a R08: 0000000000000002 R09: 0000000000208800
>>  R10: 0000000000000000 R11: 0000000000208801 R12: ffff8804058c6168
>>  R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000001
>>  FS:  00007f4548027fc0(0000) GS:ffff880431d40000(0000) knlGS:0000000000000000
>>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>  CR2: 000056316d5f8988 CR3: 00000004298cc000 CR4: 00000000000406e0
>>  Call Trace:
>>   __dax_invalidate_mapping_entry+0xab/0xe0
>>   dax_delete_mapping_entry+0xf/0x20
>>   truncate_exceptional_pvec_entries.part.14+0x1d4/0x210
>>   truncate_inode_pages_range+0x291/0x920
>>   ? kmem_cache_free+0x1f8/0x300
>>   ? lock_acquire+0x9f/0x200
>>   ? truncate_inode_pages_final+0x31/0x50
>>   ext4_evict_inode+0x69/0x740
>>
>> Cc: <stable@vger.kernel.org>
>> Fixes: bd1ce5f91f54 ("HWPOISON: avoid grabbing the page count...")
>> Cc: Michal Hocko <mhocko@suse.com>
>> Cc: Andi Kleen <ak@linux.intel.com>
>> Cc: Wu Fengguang <fengguang.wu@intel.com>
>> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
>> ---
>>  mm/madvise.c |   11 ++++++++---
>>  1 file changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 4d3c922ea1a1..246fa4d4eee2 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -631,11 +631,13 @@ static int madvise_inject_error(int behavior,
>>
>>
>>       for (; start < end; start += PAGE_SIZE << order) {
>> +             unsigned long pfn;
>>               int ret;
>>
>>               ret = get_user_pages_fast(start, 1, 0, &page);
>>               if (ret != 1)
>>                       return ret;
>> +             pfn = page_to_pfn(page);
>>
>>               /*
>>                * When soft offlining hugepages, after migrating the page
>> @@ -651,17 +653,20 @@ static int madvise_inject_error(int behavior,
>>
>>               if (behavior == MADV_SOFT_OFFLINE) {
>>                       pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
>> -                                             page_to_pfn(page), start);
>> +                                     pfn, start);
>>
>>                       ret = soft_offline_page(page, MF_COUNT_INCREASED);
>> +                     put_page(page);
>>                       if (ret)
>>                               return ret;
>>                       continue;
>>               }
>> +             put_page(page);
>
> We keep the page count pinned after the isolation of the error page
> in order to make sure that the error page is disabled and never reused.
> This seems not explicit enough, so some comment should be helpful.

As far as I can see this extra reference count to keep the page from
being should be taken internal to memory_failure(), not assumed from
the inject error path. I might be overlooking something, but I do not
see who is responsible for taking this extra reference in the case
where memory_failure() is called by the machine check code rather than
madvise_inject_error()?

>
> BTW, looking at the kernel message like "Memory failure: 0x208900:
> reserved kernel page still referenced by 1 users", memory_failure()
> considers dav_pagemap pages as "reserved kernel pages" (MF_MSG_KERNEL).
> If memory error handler recovers a dav_pagemap page in its special way,
> we can define a new action_page_types entry like MF_MSG_DAX.
> Reporting like "Memory failure: 0xXXXXX: recovery action for dax page:
> Failed" might be helpful for end user's perspective.

Sounds good, I'll take a look at this.

Patch
diff mbox

diff --git a/mm/madvise.c b/mm/madvise.c
index 4d3c922ea1a1..246fa4d4eee2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -631,11 +631,13 @@  static int madvise_inject_error(int behavior,
 
 
 	for (; start < end; start += PAGE_SIZE << order) {
+		unsigned long pfn;
 		int ret;
 
 		ret = get_user_pages_fast(start, 1, 0, &page);
 		if (ret != 1)
 			return ret;
+		pfn = page_to_pfn(page);
 
 		/*
 		 * When soft offlining hugepages, after migrating the page
@@ -651,17 +653,20 @@  static int madvise_inject_error(int behavior,
 
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
-						page_to_pfn(page), start);
+					pfn, start);
 
 			ret = soft_offline_page(page, MF_COUNT_INCREASED);
+			put_page(page);
 			if (ret)
 				return ret;
 			continue;
 		}
+		put_page(page);
+
 		pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
-						page_to_pfn(page), start);
+				pfn, start);
 
-		ret = memory_failure(page_to_pfn(page), MF_COUNT_INCREASED);
+		ret = memory_failure(pfn, MF_COUNT_INCREASED);
 		if (ret)
 			return ret;
 	}