diff mbox series

[3/7] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED

Message ID 20201119105716.5962-4-osalvador@suse.de (mailing list archive)
State New, archived
Headers show
Series HWPoison: Refactor get page interface | expand

Commit Message

Oscar Salvador Nov. 19, 2020, 10:57 a.m. UTC
From: Naoya Horiguchi <naoya.horiguchi@nec.com>

The call to get_user_pages_fast is only to get the pointer to a struct
page of a given address, pinning it is memory-poisoning handler's job,
so drop the refcount grabbed by get_user_pages_fast().

Note that the target page is still pinned after this put_page() because
the current process should have refcount from mapping.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/madvise.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

Comments

Vlastimil Babka Nov. 25, 2020, 6:20 p.m. UTC | #1
On 11/19/20 11:57 AM, Oscar Salvador wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> The call to get_user_pages_fast is only to get the pointer to a struct
> page of a given address, pinning it is memory-poisoning handler's job,
> so drop the refcount grabbed by get_user_pages_fast().
> 
> Note that the target page is still pinned after this put_page() because
> the current process should have refcount from mapping.

Well, but can't it go away due to reclaim, migration or whatever?

> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>   mm/madvise.c | 19 +++++++++++--------
>   1 file changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c6b5524add58..7a0f64b93635 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -900,20 +900,23 @@ static int madvise_inject_error(int behavior,
>   		 */
>   		size = page_size(compound_head(page));
>   
> +		/*
> +		 * The get_user_pages_fast() is just to get the pfn of the
> +		 * given address, and the refcount has nothing to do with
> +		 * what we try to test, so it should be released immediately.
> +		 * This is racy but it's intended because the real hardware
> +		 * errors could happen at any moment and memory error handlers
> +		 * must properly handle the race.

Sure they have to. We might just be unexpectedly messing with other process' 
memory. Or does anything else prevent that?

> +		 */
> +		put_page(page);
> +
>   		if (behavior == MADV_SOFT_OFFLINE) {
>   			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
>   				 pfn, start);
> -			ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
> +			ret = soft_offline_page(pfn, 0);
>   		} else {
>   			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
>   				 pfn, start);
> -			/*
> -			 * Drop the page reference taken by get_user_pages_fast(). In
> -			 * the absence of MF_COUNT_INCREASED the memory_failure()
> -			 * routine is responsible for pinning the page to prevent it
> -			 * from being released back to the page allocator.
> -			 */
> -			put_page(page);
>   			ret = memory_failure(pfn, 0);
>   		}
>   
>
Oscar Salvador Dec. 1, 2020, 11:35 a.m. UTC | #2
On Wed, Nov 25, 2020 at 07:20:33PM +0100, Vlastimil Babka wrote:
> On 11/19/20 11:57 AM, Oscar Salvador wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > The call to get_user_pages_fast is only to get the pointer to a struct
> > page of a given address, pinning it is memory-poisoning handler's job,
> > so drop the refcount grabbed by get_user_pages_fast().
> > 
> > Note that the target page is still pinned after this put_page() because
> > the current process should have refcount from mapping.
> 
> Well, but can't it go away due to reclaim, migration or whatever?

Yes, it can.

> > @@ -900,20 +900,23 @@ static int madvise_inject_error(int behavior,
> >   		 */
> >   		size = page_size(compound_head(page));
> > +		/*
> > +		 * The get_user_pages_fast() is just to get the pfn of the
> > +		 * given address, and the refcount has nothing to do with
> > +		 * what we try to test, so it should be released immediately.
> > +		 * This is racy but it's intended because the real hardware
> > +		 * errors could happen at any moment and memory error handlers
> > +		 * must properly handle the race.
> 
> Sure they have to. We might just be unexpectedly messing with other process'
> memory. Or does anything else prevent that?

No, nothing does, and I have to confess that I managed to confuse myself here.
If we release such page and that page ends up in buddy, nothing prevents someone
else to get that page, and then we would be messing with other process memory.

I guess the right thing to do is just to make sure we got that page and that
that page remains pinned as long as the memory failure handling goes.

I will remove those patches from the patchset and re-submit with only the
refactoring and pcp-disabling.

Thanks Vlastimil
Vlastimil Babka Dec. 4, 2020, 5:25 p.m. UTC | #3
On 12/1/20 12:35 PM, Oscar Salvador wrote:
> On Wed, Nov 25, 2020 at 07:20:33PM +0100, Vlastimil Babka wrote:
>> On 11/19/20 11:57 AM, Oscar Salvador wrote:
>> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
>> > 
>> > The call to get_user_pages_fast is only to get the pointer to a struct
>> > page of a given address, pinning it is memory-poisoning handler's job,
>> > so drop the refcount grabbed by get_user_pages_fast().
>> > 
>> > Note that the target page is still pinned after this put_page() because
>> > the current process should have refcount from mapping.
>> 
>> Well, but can't it go away due to reclaim, migration or whatever?
> 
> Yes, it can.
> 
>> > @@ -900,20 +900,23 @@ static int madvise_inject_error(int behavior,
>> >   		 */
>> >   		size = page_size(compound_head(page));
>> > +		/*
>> > +		 * The get_user_pages_fast() is just to get the pfn of the
>> > +		 * given address, and the refcount has nothing to do with
>> > +		 * what we try to test, so it should be released immediately.
>> > +		 * This is racy but it's intended because the real hardware
>> > +		 * errors could happen at any moment and memory error handlers
>> > +		 * must properly handle the race.
>> 
>> Sure they have to. We might just be unexpectedly messing with other process'
>> memory. Or does anything else prevent that?
> 
> No, nothing does, and I have to confess that I managed to confuse myself here.
> If we release such page and that page ends up in buddy, nothing prevents someone
> else to get that page, and then we would be messing with other process memory.
> 
> I guess the right thing to do is just to make sure we got that page and that
> that page remains pinned as long as the memory failure handling goes.

OK, so that means we don't introduce this race for MADV_SOFT_OFFLINE, but it's
already (and still) there for MADV_HWPOISON since Dan's 23e7b5c2e271 ("mm,
madvise_inject_error: Let memory_failure() optionally take a page reference") no?

> I will remove those patches from the patchset and re-submit with only the
> refactoring and pcp-disabling.
> 
> Thanks Vlastimil
>
Oscar Salvador Dec. 5, 2020, 3:34 p.m. UTC | #4
On Fri, Dec 04, 2020 at 06:25:31PM +0100, Vlastimil Babka wrote:
> OK, so that means we don't introduce this race for MADV_SOFT_OFFLINE, but it's
> already (and still) there for MADV_HWPOISON since Dan's 23e7b5c2e271 ("mm,
> madvise_inject_error: Let memory_failure() optionally take a page reference") no?

What about the following?
CCing Dan as well.

From: Oscar Salvador <osalvador@suse.de>
Date: Sat, 5 Dec 2020 16:14:40 +0100
Subject: [PATCH] mm,memory_failure: Always pin the page in
 madvise_inject_error

madvise_inject_error() uses get_user_pages_fast to get the page
from the addr we specified.
After [1], we drop such extra reference for memory_failure() path.
That commit says that memory_failure wanted to keep the pin in order
to take the page out of circulation.

The truth is that we need to keep the page pinned, otherwise the
page might be re-used after the put_page(), and we can end up messing
with someone else's memory.
E.g:

CPU0
process X					CPU1
 madvise_inject_error
  get_user_pages
   put_page
					page gets reclaimed
					process Y allocates the page
  memory_failure
   // We mess with process Y memory

madvise() is meant to operate on a self address space, so messing with
pages that do not belong to us seems the wrong thing to do.
To avoid that, let us keep the page pinned for memory_failure as well.

Pages for DAX mappings will release this extra refcount in
memory_failure_dev_pagemap.

[1] ("23e7b5c2e271: mm, madvise_inject_error:
      Let memory_failure() optionally take a page reference")

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: 23e7b5c2e271 ("mm, madvise_inject_error: Let memory_failure() optionally take a page reference")
---
 mm/madvise.c        | 9 +--------
 mm/memory-failure.c | 6 ++++++
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index c6b5524add58..19edddba196d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -907,14 +907,7 @@ static int madvise_inject_error(int behavior,
 		} else {
 			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
 				 pfn, start);
-			/*
-			 * Drop the page reference taken by get_user_pages_fast(). In
-			 * the absence of MF_COUNT_INCREASED the memory_failure()
-			 * routine is responsible for pinning the page to prevent it
-			 * from being released back to the page allocator.
-			 */
-			put_page(page);
-			ret = memory_failure(pfn, 0);
+			ret = memory_failure(pfn, MF_COUNT_INCREASED);
 		}
 
 		if (ret)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 869ece2a1de2..ba861169c9ae 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1269,6 +1269,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	if (!cookie)
 		goto out;
 
+	if (flags & MF_COUNT_INCREASED)
+		/*
+		 * Drop the extra refcount in case we come from madvise().
+		 */
+		put_page(page);
+
 	if (hwpoison_filter(page)) {
 		rc = 0;
 		goto unlock;
HORIGUCHI NAOYA(堀口 直也) Dec. 7, 2020, 2:34 a.m. UTC | #5
On Sat, Dec 05, 2020 at 04:34:23PM +0100, Oscar Salvador wrote:
> On Fri, Dec 04, 2020 at 06:25:31PM +0100, Vlastimil Babka wrote:
> > OK, so that means we don't introduce this race for MADV_SOFT_OFFLINE, but it's
> > already (and still) there for MADV_HWPOISON since Dan's 23e7b5c2e271 ("mm,
> > madvise_inject_error: Let memory_failure() optionally take a page reference") no?
> 
> What about the following?
> CCing Dan as well.

Hi Oscar, Vlastimil,

Thanks for mentioning this. I agree with that direction.

> 
> From: Oscar Salvador <osalvador@suse.de>
> Date: Sat, 5 Dec 2020 16:14:40 +0100
> Subject: [PATCH] mm,memory_failure: Always pin the page in
>  madvise_inject_error
> 
> madvise_inject_error() uses get_user_pages_fast to get the page
> from the addr we specified.
> After [1], we drop such extra reference for memory_failure() path.
> That commit says that memory_failure wanted to keep the pin in order
> to take the page out of circulation.
> 
> The truth is that we need to keep the page pinned, otherwise the
> page might be re-used after the put_page(), and we can end up messing
> with someone else's memory.
> E.g:
> 
> CPU0
> process X					CPU1
>  madvise_inject_error
>   get_user_pages
>    put_page
> 					page gets reclaimed
> 					process Y allocates the page
>   memory_failure
>    // We mess with process Y memory
> 
> madvise() is meant to operate on a self address space, so messing with
> pages that do not belong to us seems the wrong thing to do.
> To avoid that, let us keep the page pinned for memory_failure as well.
> 
> Pages for DAX mappings will release this extra refcount in
> memory_failure_dev_pagemap.
> 
> [1] ("23e7b5c2e271: mm, madvise_inject_error:
>       Let memory_failure() optionally take a page reference")
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Fixes: 23e7b5c2e271 ("mm, madvise_inject_error: Let memory_failure() optionally take a page reference")
> ---
>  mm/madvise.c        | 9 +--------
>  mm/memory-failure.c | 6 ++++++
>  2 files changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c6b5524add58..19edddba196d 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -907,14 +907,7 @@ static int madvise_inject_error(int behavior,
>  		} else {
>  			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
>  				 pfn, start);
> -			/*
> -			 * Drop the page reference taken by get_user_pages_fast(). In
> -			 * the absence of MF_COUNT_INCREASED the memory_failure()
> -			 * routine is responsible for pinning the page to prevent it
> -			 * from being released back to the page allocator.
> -			 */
> -			put_page(page);
> -			ret = memory_failure(pfn, 0);
> +			ret = memory_failure(pfn, MF_COUNT_INCREASED);
>  		}
>  
>  		if (ret)
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 869ece2a1de2..ba861169c9ae 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1269,6 +1269,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>  	if (!cookie)
>  		goto out;
>  
> +	if (flags & MF_COUNT_INCREASED)
> +		/*
> +		 * Drop the extra refcount in case we come from madvise().
> +		 */
> +		put_page(page);
> +

Should this if-block come before dax_lock_page() block?
It seems that if dax_lock_page returns NULL, memory_failure_dev_pagemap()
returns without releasing the refcount.
memory_failure() on dev_pagemap doesn't use page refcount (unlike other
type of memory), so we can release it unconditionally.

Thanks,
Naoya Horiguchi
Oscar Salvador Dec. 7, 2020, 7:24 a.m. UTC | #6
On 2020-12-07 03:34, HORIGUCHI NAOYA wrote:
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 869ece2a1de2..ba861169c9ae 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1269,6 +1269,12 @@ static int memory_failure_dev_pagemap(unsigned 
>> long pfn, int flags,
>>  	if (!cookie)
>>  		goto out;
>> 
>> +	if (flags & MF_COUNT_INCREASED)
>> +		/*
>> +		 * Drop the extra refcount in case we come from madvise().
>> +		 */
>> +		put_page(page);
>> +
> 
> Should this if-block come before dax_lock_page() block?

Yeah, it should go first thing since as you noticed we kept the refcount 
if we fail.
Saturday brain... I will fix it.

Thanks Naoya
diff mbox series

Patch

diff --git a/mm/madvise.c b/mm/madvise.c
index c6b5524add58..7a0f64b93635 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -900,20 +900,23 @@  static int madvise_inject_error(int behavior,
 		 */
 		size = page_size(compound_head(page));
 
+		/*
+		 * The get_user_pages_fast() is just to get the pfn of the
+		 * given address, and the refcount has nothing to do with
+		 * what we try to test, so it should be released immediately.
+		 * This is racy but it's intended because the real hardware
+		 * errors could happen at any moment and memory error handlers
+		 * must properly handle the race.
+		 */
+		put_page(page);
+
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
 				 pfn, start);
-			ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
+			ret = soft_offline_page(pfn, 0);
 		} else {
 			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
 				 pfn, start);
-			/*
-			 * Drop the page reference taken by get_user_pages_fast(). In
-			 * the absence of MF_COUNT_INCREASED the memory_failure()
-			 * routine is responsible for pinning the page to prevent it
-			 * from being released back to the page allocator.
-			 */
-			put_page(page);
 			ret = memory_failure(pfn, 0);
 		}